> But if we need to re-spider everything, don't we need to re-index all > documents, whether they've changed or not? If so, then we need to do > htdig -i all the time. If we don't reparse every document, we need some > other means to re-validate every document to which an unchanged document > has links.
Nope, if head_before_get=TRUE we use the HEAD request and the HTTP server is kind enough to give us the timestamp on the document in the header. If the timestamps are the same we don't bother to download it. > I think you misinterpreted what Lachlan suggested, i.e. the case where Y > does NOT change. If Y is the only document with a link to X, and Y does > not change, it will still have the link to X, so X is still "valid". > However, if Y didn't change, and htdig (without -i) doesn't reindex Y, > then how will it find the link to X to validate X's presence in the db? Changing Y is the point! I think my original description was unclear. Bug #1 1) Website contains page X. There is at least one page that contains a link to X. 2) Remove all links to X in the website, but don't delete it. Run htdig without the -i option. 3) Do a search and notice that page X is still returned, even though it technically isn't in the 'website' anymore... it is orphaned on the webserver. Bug #2 1) make start_url contain two separate websites & set up filters accordingly. 2) run htdig -i.... all is OK 3) remove on of the websites from the start_url 4) rerun htdig without -i. 5) do a search and note that the removed websites pages are still returned! > > > I'd be inclined not to fix this until after we've released the next > > > "archive point", whether that be 3.2.0b5 or 3.2.0rc1... > > I'd be inclined to agree. If it comes down to the possibility of > losing valid documents in the db vs. keeping invalid ones, I'd prefer > the latter behaviour. Until we can find a way to ensure all currently > linked documents remain in the db, without having to reparse them all, > then I think the current behaviour is the best compromise. If you > want to reparse everything to ensure a clean db with accurate linkages, > that's what -i is for. If you change all pages to remove a link to a page that doesn't get deleted, the HTTP header will change and HtDig re-downloads it.. thus giving correct behavior. The fix accomplishes this. There is no danger of 'losing valid documents'. The datestamp in the http header with the proper logic will guarantee proper behavior. If a page changes, it's re-downloaded and reparsed and its links are examined for changes. Orphaned pages are never revisited, and are purged after the spider is done. I've spent hours inside a debugger examining how the spider does things... I will continue to look for efficiency gains. This bug in minor, and a decent workaround exists... so I agree with waiting to commit the fix. I'll sit on it and come up with an actual test case at the appropriate time to demonstrate the bug. It's just plain inefficient the way we currently do it, we revist pages that don't need it and carry cruft in the database that is deadweight. However I would strongly recommend we enable head_before_get by default. We're basically wasting bandwidth like drunken sailors with it off!!! Thanks. Jessica: I'm heading to the Pub here in Bozeman, MT. I'll draw some stuff on napkins for ya! Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ ht://Dig Developer mailing list: [EMAIL PROTECTED] List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-dev