Hi Aleksander, Many thanks for your thoughts.
Some comments: > * file_open() should just do an HTTP-HEAD request, and store the > information we get back, specially Accept-Ranges and Content-Length > values (when available). Note that it is ok if we don't get a > Content-Length value, not a big deal. I agree entirely. I could imagine, if ranges are allowed, we might want to immediately download X bytes (a small number) since the front-end of the PDF will almost always be read. But I agree that downloading the entire file immediately is undesirable. > * file_read() should: > a) if the HTTP server replied "Accept-Ranges: bytes" in the > HTTP-HEAD request of the open(), we should setup an HTTP-GET request > requesting *only* to retrieve the byte range we want (a chunk of bytes > starting in current file offset and with the size of the read() > operation. See [1]. > b) if the HTTP server doesn't support ranges request, we should > probably return an error. We really don't want to try to read 100 bytes > from a file and then end up fully downloading a 4GB file. Thus, it makes > sense to support only HTTP 1.1 servers with byte serving capabilities. This has some merit. I guess most modern web servers would support ranges for static files (although some PDFs might be generated on-the-fly). Also, if a bunch of small file_read() calls are made (ex. 200 bytes, then 400 bytes, then 300 bytes...), a lot of connections would be opened, with a undesirable increase in overhead, unless we are smart with recycling connections where possible. Even if we recycle connections, too many small requests would likely cause a bit of a slowdown. I presume the PDF parser will only file_read() a given range of the file once. If this is not the case, we should probably consider caching the data we download. I suppose RIA data should also be cached. (I could imagine someone RIAing a large range, but then only reading a small chunk from that range in any one file_read(). Maybe this is the solution to the problem of too many requests for small ranges.) Thanks. William Demchick On Wed, Jun 1, 2011 at 6:10 AM, Aleksander Morgado <aleksan...@gnu.org> wrote: > Hi all, > > I've been checking what we already have in trunk for the HTTP fsys > implementation, and I'm seeing some issues which would need some > discussion: > > 1) file_open() on a file in the HTTP filesystem starts downloading the > whole remote file to a local temp file. If the remote file is 4GB, it > will start to download all 4GB into a temp file, even if no read() > operation is then done. That's not something desired. > > 2) If you open() the file and right away try to read let's say 1Mbyte of > data into a buffer in memory, it won't work. if you ask to read 1MByte > of data, read() should really wait until the whole megabyte is read. A > read() operation reporting read_bytes < requested_bytes_to_read should > only happen on EOF. > > 3) Files are downloaded to temp files using an additional 'worker' > thread. This thread will select() on a set of FDs computed by curl for > each http-get request we request. As already said, the whole download > process and the possible read() requests are right now completely > disconnected. > > 4) RIA shouldn't need the file pre-downloaded. In general, we should > never need the to fully download the file and then fallback to the disk > filesystem to read from that temp file. > > > My view of how the HTTP filesystem should work is as follows: > > * file_open() should just do an HTTP-HEAD request, and store the > information we get back, specially Accept-Ranges and Content-Length > values (when available). Note that it is ok if we don't get a > Content-Length value, not a big deal. > > * file_read() should: > a) if the HTTP server replied "Accept-Ranges: bytes" in the > HTTP-HEAD request of the open(), we should setup an HTTP-GET request > requesting *only* to retrieve the byte range we want (a chunk of bytes > starting in current file offset and with the size of the read() > operation. See [1]. > b) if the HTTP server doesn't support ranges request, we should > probably return an error. We really don't want to try to read 100 bytes > from a file and then end up fully downloading a 4GB file. Thus, it makes > sense to support only HTTP 1.1 servers with byte serving capabilities. > > * totally avoid the temp file in disk. > > * RIA should be implemented just as a file_read(), with the difference > being that the HTTP-GET request sent for RIA really does go into a > different thread. A next read() request on a previously-RIA-requested > chunk would then wait until the RIA request is finished if not already > done, and return what the RIA request got. In fact, from an > implementation point of view, file_read() could just rely internally in > a RIA request and block until the RIA request is finished. > > * I would like to make file_read() fully synchronous and blocking, and > let application programmers using this library to perform the > file_read() operation themselves in a different thread if they wish to > avoid blocking the main thread during a file_read(). As I said before, > it makes sense to force file_read() only return read_bytes < > requested_bytes on EOF; which means that file_read() really needs to > block until all requested bytes are read. > > * In some future, when this library is used along with other libraries > implementing event loops, it will make sense to export the FDs that need > to be polled so that polling is done by the outer event loop, and avoid > the extra threads ourselves. We should keep this in mind. > > Comments? > > Cheers, > > [1] https://secure.wikimedia.org/wikipedia/en/wiki/Byte_serving > > -- > Aleksander >