Re: [htdig] Followup: update dig missing new documents in mailinglist archive

Geoff Hutchison Sat, 01 Dec 2001 15:59:08 -0800

At 9:18 PM +0100 12/1/01, Roman Maeder wrote:
>But the index pages are excluded from the database, so it never checks
>those for new links.


Yes, this could be a problem.

>Shouldn't it either traverse the document space starting with the
>start URLs also for update digs (in the same way it does with an initial
>dig), or maybe keep a list of excluced documents and check those as well?

You don't want to traverse the document space if you can help 
it--that would require parsing all the documents again or keeping the 
entire link structure of the site. The latter may be useful for other 
purposes, but that could involve some serious additional storage for 
some sites.

As far as keeping a list of excluded documents, this may be the right 
way to go. Right now htmerge/htpurge completely remove all traces of 
a document if it was marked "noindex." Probably the solution is to 
leave the document (still marked "noindex") but make sure all words 
are removed from the word db. This way it would never come up in a 
search.

This is a very good point you've raised, thanks.
-- 
--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Re: [htdig] Followup: update dig missing new documents in mailinglist archive

Reply via email to