Hi all, I've been checking what we already have in trunk for the HTTP fsys implementation, and I'm seeing some issues which would need some discussion:
1) file_open() on a file in the HTTP filesystem starts downloading the whole remote file to a local temp file. If the remote file is 4GB, it will start to download all 4GB into a temp file, even if no read() operation is then done. That's not something desired. 2) If you open() the file and right away try to read let's say 1Mbyte of data into a buffer in memory, it won't work. if you ask to read 1MByte of data, read() should really wait until the whole megabyte is read. A read() operation reporting read_bytes < requested_bytes_to_read should only happen on EOF. 3) Files are downloaded to temp files using an additional 'worker' thread. This thread will select() on a set of FDs computed by curl for each http-get request we request. As already said, the whole download process and the possible read() requests are right now completely disconnected. 4) RIA shouldn't need the file pre-downloaded. In general, we should never need the to fully download the file and then fallback to the disk filesystem to read from that temp file. My view of how the HTTP filesystem should work is as follows: * file_open() should just do an HTTP-HEAD request, and store the information we get back, specially Accept-Ranges and Content-Length values (when available). Note that it is ok if we don't get a Content-Length value, not a big deal. * file_read() should: a) if the HTTP server replied "Accept-Ranges: bytes" in the HTTP-HEAD request of the open(), we should setup an HTTP-GET request requesting *only* to retrieve the byte range we want (a chunk of bytes starting in current file offset and with the size of the read() operation. See [1]. b) if the HTTP server doesn't support ranges request, we should probably return an error. We really don't want to try to read 100 bytes from a file and then end up fully downloading a 4GB file. Thus, it makes sense to support only HTTP 1.1 servers with byte serving capabilities. * totally avoid the temp file in disk. * RIA should be implemented just as a file_read(), with the difference being that the HTTP-GET request sent for RIA really does go into a different thread. A next read() request on a previously-RIA-requested chunk would then wait until the RIA request is finished if not already done, and return what the RIA request got. In fact, from an implementation point of view, file_read() could just rely internally in a RIA request and block until the RIA request is finished. * I would like to make file_read() fully synchronous and blocking, and let application programmers using this library to perform the file_read() operation themselves in a different thread if they wish to avoid blocking the main thread during a file_read(). As I said before, it makes sense to force file_read() only return read_bytes < requested_bytes on EOF; which means that file_read() really needs to block until all requested bytes are read. * In some future, when this library is used along with other libraries implementing event loops, it will make sense to export the FDs that need to be polled so that polling is done by the outer event loop, and avoid the extra threads ourselves. We should keep this in mind. Comments? Cheers, [1] https://secure.wikimedia.org/wikipedia/en/wiki/Byte_serving -- Aleksander
signature.asc
Description: This is a digitally signed message part