According to Neal Richter:
> On Fri, 3 Oct 2003, Lachlan Andrew wrote:
> > I'm not sure that I understand this.  If a page 'X' is linked only by
> > a page 'Y' which isn't changed since the previous dig, do we parse
> > the unchanged page 'Y'?  If so, why not run  htdig -i?  If not, how
> > do we know that page 'X' should still be in the database?
> 
> X does not change, but Y does.. it no longer has a link to X.
> 
> If the website is big enough htdig -i is wastefull of network bandwidth.
> 
> The locical error as I see it is that we revisit the list of documents
> currently in the index, rather than starting from the beginning and
> spidering... then removing the all documents we didn't find links for.

But if we need to re-spider everything, don't we need to re-index all
documents, whether they've changed or not?  If so, then we need to do
htdig -i all the time.  If we don't reparse every document, we need some
other means to re-validate every document to which an unchanged document
has links.

I think you misinterpreted what Lachlan suggested, i.e. the case where Y
does NOT change.  If Y is the only document with a link to X, and Y does
not change, it will still have the link to X, so X is still "valid".
However, if Y didn't change, and htdig (without -i) doesn't reindex Y,
then how will it find the link to X to validate X's presence in the db?

> > I'd be inclined not to fix this until after we've released the next
> > "archive point", whether that be 3.2.0b5 or 3.2.0rc1...

I'd be inclined to agree.  If it comes down to the possibility of
losing valid documents in the db vs. keeping invalid ones, I'd prefer
the latter behaviour.  Until we can find a way to ensure all currently
linked documents remain in the db, without having to reparse them all,
then I think the current behaviour is the best compromise.  If you
want to reparse everything to ensure a clean db with accurate linkages,
that's what -i is for.

A somewhat related problem/limitation in update digs is that the backlink
count and link depth from start_url may not get properly updated for
documents that aren't reparsed.  If these matter to you, periodic full
digs may be needed to restore the accuracy of these fields.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
ht://Dig Developer mailing list:
[EMAIL PROTECTED]
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Reply via email to