> But if we need to re-spider everything, don't we need to re-index all
> documents, whether they've changed or not?  If so, then we need to do
> htdig -i all the time.  If we don't reparse every document, we need some
> other means to re-validate every document to which an unchanged document
> has links.

  Nope, if head_before_get=TRUE we use the HEAD request and the HTTP
server is kind enough to give us the timestamp on the document in the header.
If the timestamps are the same we don't bother to download it.

> I think you misinterpreted what Lachlan suggested, i.e. the case where Y
> does NOT change.  If Y is the only document with a link to X, and Y does
> not change, it will still have the link to X, so X is still "valid".
> However, if Y didn't change, and htdig (without -i) doesn't reindex Y,
> then how will it find the link to X to validate X's presence in the db?

  Changing Y is the point!  I think my original description was unclear.

  Bug #1

  1) Website contains page X.  There is at least one page that contains a
     link to X.
  2) Remove all links to X in the website, but don't delete it. Run htdig
     without the -i option.
  3) Do a search and notice that page X is still returned, even though it
     technically isn't in the 'website' anymore... it is orphaned on the
     webserver.

  Bug #2

  1) make start_url contain two separate websites & set up filters
     accordingly.
  2) run htdig -i.... all is OK
  3) remove on of the websites from the start_url
  4) rerun htdig without -i.
  5) do a search and note that the removed websites pages are still
     returned!


> > > I'd be inclined not to fix this until after we've released the next
> > > "archive point", whether that be 3.2.0b5 or 3.2.0rc1...
>
> I'd be inclined to agree.  If it comes down to the possibility of
> losing valid documents in the db vs. keeping invalid ones, I'd prefer
> the latter behaviour.  Until we can find a way to ensure all currently
> linked documents remain in the db, without having to reparse them all,
> then I think the current behaviour is the best compromise.  If you
> want to reparse everything to ensure a clean db with accurate linkages,
> that's what -i is for.

  If you change all pages to remove a link to a page that doesn't get
deleted, the HTTP header will change and HtDig re-downloads it.. thus
giving correct behavior.

  The fix accomplishes this.  There is no danger of 'losing valid
documents'.  The datestamp in the http header with the proper logic will
guarantee proper behavior.  If a page changes, it's re-downloaded
and reparsed and its links are examined for changes.  Orphaned pages are
never revisited, and are purged after the spider is done.

  I've spent hours inside a debugger examining how the spider does
things... I will continue to look for efficiency gains.

  This bug in minor, and a decent workaround exists... so I agree with
waiting to commit the fix.

  I'll sit on it and come up with an actual test case at the appropriate
time to demonstrate the bug.  It's just plain inefficient the way we
currently do it, we revist pages that don't need it and carry cruft in the
database that is deadweight.

  However I would strongly recommend we enable head_before_get by default.
We're basically wasting bandwidth like drunken sailors with it off!!!

  Thanks.

  Jessica:  I'm heading to the Pub here in Bozeman, MT.  I'll draw some
stuff on napkins for ya!

Neal Richter
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485








-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
ht://Dig Developer mailing list:
[EMAIL PROTECTED]
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Reply via email to