Hi, > Right. I'm generally very careful with adding new APIs, especially such that > aren't strictly transfer-related and I would say MD5 isn't about transfers. > > All new functions take their share of added maintenance and work.
Well, I was just pointing to the abstraction layers already implemented in curl. So in essence I was just hoping for the symbols to be exported, so the internal functions would become part of the API. But I see your point that this is not really related to curls job. > Well yes, but those two are in the library and in the tool, pretty much for > the same reason you bring up here! Exactly. So if the hashing functions will be part of the curl API, the tool would just use that instead of doing the same stuff all over again. > >While looking into this I also noticed that the metalink code does the > >verification _after_ the download, which Daniel also mentions [0]. In the > >mentioned RFCs about the headers and XML format I found no mention of the > >time of the hash processing. Why not do it while downloading? > > I don't think there's any good reason other than it hasn't been done. > Possibly because nobody has cared enough to actually do the work. So, it turns out, things are not as simple as I thought. For complicated stuff like resuming a download, one would need to take special care when "hashing while downloading", since one has not computed the hash of the first chunk of the file which is to be extended by the curl. Also I was told, the hashing functions inside Apt are not that slow but more the fast that they are applying multiple hash functions one after another on the chunks received. Since Apt is single-threaded this really slows down the download. When resuming a download for which a hash shall be computed like in Metalink case, one can either use the existing code and hash at the end, or launch up a new thread which would hash the file to the current position while the start downloading. Then once the current position (which could be different to when curl was started due to the continued downloading) is reached, the thread would be joined in or consequently fed with the data on-the-fly. Unfortunately this introduces new complexity. It is not really nice to start reading a file from offset 0 when one wants to append it at the same instance. Today in the real world this would probably not cause problems due to the writeback buffer in the main memory being used as a cache first and then utilizing a SSD with good random seek performance. However in the classic hard disk driven model, such a strategy would trigger competing seeks to the start and end of the files for reading and writing decreasing life-time of the poor disks and depending on the speed of the wire, maybe even slowing down the download. Then instead of relying on the filesystem to do the job of caching incoming data in main memory when still hashing, one could of course let curl allocate dynamic memory repeatedly (again like the example from the curl sources) to store the incoming data while the hashing is not finished. Once the hashing would be done, we fallback to our normal algorithm of hashing on-the-fly. As you can see this is rather unpleasant, so I would vote for a normal "hashing at the end" when resuming a download. Also depending on the processing performance hashing while downloading comes with a computational penalty which *should* fit in nicely in the gaps of "waiting for I/O", however as we saw with the Apt example, once the computation takes too long, it will last longer then the wire and slow down the download. In the Apt case, it seems reasonable to go multi-threaded, as all the hash function would rely on the same input data. Then OS could figure out the best time slices to interleave with I/O or possible move the computational workload to another core. What do you think (about possible changes for the metalink download stuff)? > I would prefer to have the entire VTLS part of libcurl turned into a library > of its own that libcurl could use (although it hasn't happen because there's > just not enough desire from anywhere to drive such a change). I don't think > it is libcurl's job to offer neither crypto nor hashing functionality > outside of transfers. I see your point of not exposing internal functions for crypto to the outside. Also I like the idea of an independent abstraction library. From my understanding the separation of VTLS in curl is already pretty decent, so most of the work is done. Of course when abstracting there is always the conflict of uniting the feature set and having available the functionality of a specific library but from my impression VTLS definitely provides what most people will need. > The Metalink code is not in libcurl. Sorry, I was merely referring to the glue in curl as I saw similar to exactly the same MD5 code in the two places (both the library and the tool). Last but no least, unfortunately there seems to be little ambition of the Apt folks to change anything, the Debian bug tracker is full of bugs regarding dependency resolution or bugs in the HTTP handling. To enhance the speed of updating a system, many people work-around with stuff like apt-fast or apt-metalink, which would ask apt to provide the raw URLs and download it by other means. The metalink folks would even use multiple mirrors and feed the metalink file to aria2 which then downloads from multiple servers at the same time. In any case, since the HTTPS handling in Apt is already using curl, it makes a lot of sense, to also use curl for plain HTTP. Since Apt has other means of download which are probably also hand-written, it would also make sense to use curl for /everything/. I think for our objective, we would stick with HTTP and HTTPS using curl and doing hashing while downloading in concurrent threads. We would then try to upstream, but I am not too hopeful about seeing these changes upstream anytime soon (also currently Apt is not multi-threaded due to some bugs ten years ago). Best regards, Leon ------------------------------------------------------------------- List admin: http://cool.haxx.se/list/listinfo/curl-library Etiquette: http://curl.haxx.se/mail/etiquette.html
