Re: [pdf-devel] Comments about the HTTP filesystem implementation

William Demchick Tue, 31 May 2011 18:27:26 -0700

Hi Aleksander,

Many thanks for your thoughts.


Some comments:

>  * file_open() should just do an HTTP-HEAD request, and store the
> information we get back, specially Accept-Ranges and Content-Length
> values (when available). Note that it is ok if we don't get a
> Content-Length value, not a big deal.

I agree entirely.

I could imagine, if ranges are allowed, we might want to immediately
download X bytes (a small number) since the front-end of the PDF will
almost always be read.  But I agree that downloading the entire file
immediately is undesirable.

>  * file_read() should:
>    a) if the HTTP server replied "Accept-Ranges: bytes" in the
> HTTP-HEAD request of the open(), we should setup an HTTP-GET request
> requesting *only* to retrieve the byte range we want (a chunk of bytes
> starting in current file offset and with the size of the read()
> operation. See [1].
>    b) if the HTTP server doesn't support ranges request, we should
> probably return an error. We really don't want to try to read 100 bytes
> from a file and then end up fully downloading a 4GB file. Thus, it makes
> sense to support only HTTP 1.1 servers with byte serving capabilities.

This has some merit.  I guess most modern web servers would support
ranges for static files (although some PDFs might be generated
on-the-fly).

Also, if a bunch of small file_read() calls are made (ex. 200 bytes,
then 400 bytes, then 300 bytes...), a lot of connections would be
opened, with a undesirable increase in overhead, unless we are smart
with recycling connections where possible.  Even if we recycle
connections, too many small requests would likely cause a bit of a
slowdown.

I presume the PDF parser will only file_read() a given range of the
file once.  If this is not the case, we should probably consider
caching the data we download.  I suppose RIA data should also be
cached.  (I could imagine someone RIAing a large range, but then only
reading a small chunk from that range in any one file_read().  Maybe
this is the solution to the problem of too many requests for small
ranges.)

Thanks.

William Demchick

On Wed, Jun 1, 2011 at 6:10 AM, Aleksander Morgado <aleksan...@gnu.org> wrote:
> Hi all,
>
> I've been checking what we already have in trunk for the HTTP fsys
> implementation, and I'm seeing some issues which would need some
> discussion:
>
> 1) file_open() on a file in the HTTP filesystem starts downloading the
> whole remote file to a local temp file. If the remote file is 4GB, it
> will start to download all 4GB into a temp file, even if no read()
> operation is then done. That's not something desired.
>
> 2) If you open() the file and right away try to read let's say 1Mbyte of
> data into a buffer in memory, it won't work. if you ask to read 1MByte
> of data, read() should really wait until the whole megabyte is read. A
> read() operation reporting read_bytes < requested_bytes_to_read should
> only happen on EOF.
>
> 3) Files are downloaded to temp files using an additional 'worker'
> thread. This thread will select() on a set of FDs computed by curl for
> each http-get request we request. As already said, the whole download
> process and the possible read() requests are right now completely
> disconnected.
>
> 4) RIA shouldn't need the file pre-downloaded. In general, we should
> never need the to fully download the file and then fallback to the disk
> filesystem to read from that temp file.
>
>
> My view of how the HTTP filesystem should work is as follows:
>
>  * file_open() should just do an HTTP-HEAD request, and store the
> information we get back, specially Accept-Ranges and Content-Length
> values (when available). Note that it is ok if we don't get a
> Content-Length value, not a big deal.
>
>  * file_read() should:
>    a) if the HTTP server replied "Accept-Ranges: bytes" in the
> HTTP-HEAD request of the open(), we should setup an HTTP-GET request
> requesting *only* to retrieve the byte range we want (a chunk of bytes
> starting in current file offset and with the size of the read()
> operation. See [1].
>    b) if the HTTP server doesn't support ranges request, we should
> probably return an error. We really don't want to try to read 100 bytes
> from a file and then end up fully downloading a 4GB file. Thus, it makes
> sense to support only HTTP 1.1 servers with byte serving capabilities.
>
>  * totally avoid the temp file in disk.
>
>  * RIA should be implemented just as a file_read(), with the difference
> being that the HTTP-GET request sent for RIA really does go into a
> different thread. A next read() request on a previously-RIA-requested
> chunk would then wait until the RIA request is finished if not already
> done, and return what the RIA request got. In fact, from an
> implementation point of view, file_read() could just rely internally in
> a RIA request and block until the RIA request is finished.
>
>  * I would like to make file_read() fully synchronous and blocking, and
> let application programmers using this library to perform the
> file_read() operation themselves in a different thread if they wish to
> avoid blocking the main thread during a file_read(). As I said before,
> it makes sense to force file_read() only return read_bytes <
> requested_bytes on EOF; which means that file_read() really needs to
> block until all requested bytes are read.
>
>  * In some future, when this library is used along with other libraries
> implementing event loops, it will make sense to export the FDs that need
> to be polled so that polling is done by the outer event loop, and avoid
> the extra threads ourselves. We should keep this in mind.
>
> Comments?
>
> Cheers,
>
> [1] https://secure.wikimedia.org/wikipedia/en/wiki/Byte_serving
>
> --
> Aleksander
>

Re: [pdf-devel] Comments about the HTTP filesystem implementation

Reply via email to