I'd suggest taking a look at the current development snapshot since 
several things you mentioned are bugs that have been fixed since 
3.2.0b3.

-Geoff

At 12:54 AM -0700 9/30/01, Andrew Daviel wrote:
>The TODO html with htdig-3.2.0b3 suggests that gzip
>and compress decoders are now implemented. Perhaps I misunderstood or
>am missing some config items. E.g. my test file
>http://andrew.triumf.ca/test/latex.ps.gz returns
>   Content-type: application/postscript
>   Content-encoding: x-gzip

It's a slightly different issue. Many servers would send:
Content-type: application/x-gzip

In this case, you could use external converters to ungzip the 
postscript. At the moment, I don't think the 3.2 code considers the 
Content-encoding header in the least, which as you point out is a 
bug. What you describe shouldn't be too hard to do with external 
converters, since the most difficult issue with receiving a MIME type 
of application/x-gzip is figuring out what the core MIME type is.

>becomes unreachable during a run it seems to wait a timeout period for
>each URL. As for example where someone powered down a server while I was

This is a bug in 3.2.0b3. It should be fixed now. The issue is if the 
server dies when you still have a pile of URLs on the fetch queue.

>Grace period for dead links/servers

<shrug> Perhaps not a bad idea.

>that it wasn't even checking modification times but found it was sending
>If-Modified-Else in localtime which Apache doesn't understand.

This is an unfortunate bug in the 3.2 code, which should be fixed now.
(Though I'll point out I've seen plenty of Apache servers sending out 
Last-Modified: headers in localtime...)

>Revisit interval

This wouldn't be too difficult to implement--the database already 
stores the date visited in the DocAccessed field of a DocumentRef.

>others I could visit in the meantime. I think that in fact htdig is doing
>this, except that it does max_connection_requests on an HTTP/1.1 server.

It does this, unless you specify any number of configuration 
attributes, such as max_connection_requests, etc. (It might be a bit 
difficult to figure out the best performance balance in HTTP/1.1 
since you'd rather not break the HTTP/1.1 pipelining...)

>the list of new URLs to be visited, but I gave up. I think it ought
>to be possible but it was making my head spin.

At the present, we'd rather not multithread. We have plenty of other 
projects that need to be done of much higher priority. (Unicode snaps 
to mind.)

>Harvesting Arbitrary Metadata
>
>I was interested in harvesting metadata in documents, e.g. Author or
>Subject from HTML META or PDF pdfmark blocks, or Dublin Core Creator, Date

The 3.2 code includes a variety of new word tagging, including Author 
and Subject. At the moment, the HTML parser doesn't do all of this, 
though the bigger issue is that htsearch doesn't allow restricting 
searches to specific metadata.

>rev=made mailto, or the last mailto found on the page (for in-house use,
>email harvvesting is acceptable and it's often the page author or
>maintainer)

I don't really see this as a needed feature to ht://Dig itself since 
you can do this as a script running on top of the htdig output using 
the -s flag. See 
<http://www.htdig.org/files/contrib/scripts/showdead.pl>
<http://www.htdig.org/files/contrib/scripts/report_missing_pages.pl>

>Indexing non-parsable objects (images & multimedia, etc.)

Sure. Actually this wouldn't be so hard--you'd just need to make sure 
htpurge won't delete the URLs without excerpts, etc.


_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Reply via email to