On 06/07/2010 08:41 AM, Tony Lewis wrote: > Micah Cowan wrote: > >> Yeah, that was the original thinking. But I still hate it. For one >> thing, there are no longer any guarantees that recurse-able HTML files >> end in ".html" > > There are a bunch of suffixes that are actively used for HTML plus there is > no reason that one has to include a suffix at all. Furthermore, the > existence of a .html suffix is no guarantee that the file really contains > HTML.
Exactly. >> It's better to let you explicitly specifiy what files to download > > I think an option that says "spider the site and save any PDF files that you > find" is useful. It's a matter of figuring out a meaningful way to implement > "spider the site" for this scenario. Of course it's useful. It just shouldn't be the only possible mode of operating. That's exactly why I said we should split off the accept/reject and "download this, but only to parse it" bits, because right now the "download/parse" part is hardwired to always happen for ".htm/.html" files, and only for those files, which is nearing uselessness, for exactly the reasons you state in the first quote-block above. > I wonder if it would make more sense to look at the Content-Type header and > only parse "text/html" files. By using HEAD, you can quickly ignore files > that don't need to be parsed. For some value of "quickly". This obviously necessitates extra round-trips to the server. Can still be useful, but still perhaps not as useful as doing URL-matching properly. In particular, it would work best when _combined_ with proper URL-matching, so that you could dictate which files shouldn't even be bothered with a HEAD (why bother to see if a *.pdf file has content-type text/html?). It's made even less useful by the fact that so many servers botch HEAD completely. Providing errors on HEAD is one problem, but the bigger problem is servers that provide _erroneous_ responses to HEAD requests. But there are enough servers that get it right to make this a worthwhile feature, so long as we document the fact that it takes extra round-trips (too bad there's no If-Content-Type header in HTTP/1.1 :) ). -- Micah J. Cowan http://micah.cowan.name/
