> > I think this only became an issue because of persistent connections. > Correct me if I'm wrong, but I think htdig's behaviour in the past > (i.e. 3.1.x, and maybe 3.2 without head_before_get=TRUE) was to do a GET, > and upon seeing the headers if it decided it didn't need to refetch the > file, it would simply close the connection right away and not read the > stream of data for the file. No wasted bandwidth, but maybe it caused > some unnecessary overhead on the server, which probably started serving > up each file (including running CGI scripts if that's what made the page) > before realising the connection was closed.
True, but we can override the current setting if '-i' is given to force head_before_get=false. > > The critical part of the above, which I was trying to explain before, is > point 4 (a). If a document hasn't changed, htdig would need somehow to > keep track of every link that document had to others, so that it could > keep traversing the hierarchy of links as it crawls its way through > to every "active" page on the site. That would require additional > information in the database that htdig doesn't keep track of right now. > Right now, the only way to do a complete crawl is to reparse every > document. Yep, this is true. On the plus side, if we do keep and maintain that list I've got a strack of research papers talking about what can be done with that list to make searching better. It opens up a world of possibilities for improving relevance ranking, learning relationships between pages, etc.. Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 ------------------------------------------------------- This SF.net email is sponsored by: SF.net Giveback Program. SourceForge.net hosts over 70,000 Open Source Projects. See the people who have HELPED US provide better services: Click here: http://sourceforge.net/supporters.php _______________________________________________ ht://Dig Developer mailing list: [EMAIL PROTECTED] List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-dev
