I'd suggest taking a look at the current development snapshot since several things you mentioned are bugs that have been fixed since 3.2.0b3.
-Geoff At 12:54 AM -0700 9/30/01, Andrew Daviel wrote: >The TODO html with htdig-3.2.0b3 suggests that gzip >and compress decoders are now implemented. Perhaps I misunderstood or >am missing some config items. E.g. my test file >http://andrew.triumf.ca/test/latex.ps.gz returns > Content-type: application/postscript > Content-encoding: x-gzip It's a slightly different issue. Many servers would send: Content-type: application/x-gzip In this case, you could use external converters to ungzip the postscript. At the moment, I don't think the 3.2 code considers the Content-encoding header in the least, which as you point out is a bug. What you describe shouldn't be too hard to do with external converters, since the most difficult issue with receiving a MIME type of application/x-gzip is figuring out what the core MIME type is. >becomes unreachable during a run it seems to wait a timeout period for >each URL. As for example where someone powered down a server while I was This is a bug in 3.2.0b3. It should be fixed now. The issue is if the server dies when you still have a pile of URLs on the fetch queue. >Grace period for dead links/servers <shrug> Perhaps not a bad idea. >that it wasn't even checking modification times but found it was sending >If-Modified-Else in localtime which Apache doesn't understand. This is an unfortunate bug in the 3.2 code, which should be fixed now. (Though I'll point out I've seen plenty of Apache servers sending out Last-Modified: headers in localtime...) >Revisit interval This wouldn't be too difficult to implement--the database already stores the date visited in the DocAccessed field of a DocumentRef. >others I could visit in the meantime. I think that in fact htdig is doing >this, except that it does max_connection_requests on an HTTP/1.1 server. It does this, unless you specify any number of configuration attributes, such as max_connection_requests, etc. (It might be a bit difficult to figure out the best performance balance in HTTP/1.1 since you'd rather not break the HTTP/1.1 pipelining...) >the list of new URLs to be visited, but I gave up. I think it ought >to be possible but it was making my head spin. At the present, we'd rather not multithread. We have plenty of other projects that need to be done of much higher priority. (Unicode snaps to mind.) >Harvesting Arbitrary Metadata > >I was interested in harvesting metadata in documents, e.g. Author or >Subject from HTML META or PDF pdfmark blocks, or Dublin Core Creator, Date The 3.2 code includes a variety of new word tagging, including Author and Subject. At the moment, the HTML parser doesn't do all of this, though the bigger issue is that htsearch doesn't allow restricting searches to specific metadata. >rev=made mailto, or the last mailto found on the page (for in-house use, >email harvvesting is acceptable and it's often the page author or >maintainer) I don't really see this as a needed feature to ht://Dig itself since you can do this as a script running on top of the htdig output using the -s flag. See <http://www.htdig.org/files/contrib/scripts/showdead.pl> <http://www.htdig.org/files/contrib/scripts/report_missing_pages.pl> >Indexing non-parsable objects (images & multimedia, etc.) Sure. Actually this wouldn't be so hard--you'd just need to make sure htpurge won't delete the URLs without excerpts, etc. _______________________________________________ htdig-dev mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/htdig-dev
