[pdf-devel] Comments about the HTTP filesystem implementation

Aleksander Morgado Tue, 31 May 2011 12:19:47 -0700

Hi all,

I've been checking what we already have in trunk for the HTTP fsys
implementation, and I'm seeing some issues which would need some
discussion:


1) file_open() on a file in the HTTP filesystem starts downloading the
whole remote file to a local temp file. If the remote file is 4GB, it
will start to download all 4GB into a temp file, even if no read()
operation is then done. That's not something desired.

2) If you open() the file and right away try to read let's say 1Mbyte of
data into a buffer in memory, it won't work. if you ask to read 1MByte
of data, read() should really wait until the whole megabyte is read. A
read() operation reporting read_bytes < requested_bytes_to_read should
only happen on EOF.

3) Files are downloaded to temp files using an additional 'worker'
thread. This thread will select() on a set of FDs computed by curl for
each http-get request we request. As already said, the whole download
process and the possible read() requests are right now completely
disconnected.

4) RIA shouldn't need the file pre-downloaded. In general, we should
never need the to fully download the file and then fallback to the disk
filesystem to read from that temp file.


My view of how the HTTP filesystem should work is as follows:

 * file_open() should just do an HTTP-HEAD request, and store the
information we get back, specially Accept-Ranges and Content-Length
values (when available). Note that it is ok if we don't get a
Content-Length value, not a big deal.

 * file_read() should:
    a) if the HTTP server replied "Accept-Ranges: bytes" in the
HTTP-HEAD request of the open(), we should setup an HTTP-GET request
requesting *only* to retrieve the byte range we want (a chunk of bytes
starting in current file offset and with the size of the read()
operation. See [1].
    b) if the HTTP server doesn't support ranges request, we should
probably return an error. We really don't want to try to read 100 bytes
from a file and then end up fully downloading a 4GB file. Thus, it makes
sense to support only HTTP 1.1 servers with byte serving capabilities.

 * totally avoid the temp file in disk.

 * RIA should be implemented just as a file_read(), with the difference
being that the HTTP-GET request sent for RIA really does go into a
different thread. A next read() request on a previously-RIA-requested
chunk would then wait until the RIA request is finished if not already
done, and return what the RIA request got. In fact, from an
implementation point of view, file_read() could just rely internally in
a RIA request and block until the RIA request is finished.

 * I would like to make file_read() fully synchronous and blocking, and
let application programmers using this library to perform the
file_read() operation themselves in a different thread if they wish to
avoid blocking the main thread during a file_read(). As I said before,
it makes sense to force file_read() only return read_bytes <
requested_bytes on EOF; which means that file_read() really needs to
block until all requested bytes are read.

 * In some future, when this library is used along with other libraries
implementing event loops, it will make sense to export the FDs that need
to be polled so that polling is done by the outer event loop, and avoid
the extra threads ourselves. We should keep this in mind.

Comments?

Cheers,

[1] https://secure.wikimedia.org/wikipedia/en/wiki/Byte_serving

-- 
Aleksander

signature.asc
Description: This is a digitally signed message part

[pdf-devel] Comments about the HTTP filesystem implementation

Reply via email to