According to Bill Carlson: > I ran across a logical problem when handling <META name="robots" > content="noindex"> on a page. The behavior expected is that links on the > page will be followed and indexed. This works fine on the initial index. > > Let's call the page that shouldn't be indexed TOC (Tables Of Contents, a > typical application) and pages linked to the TOC are the content. > > If the only link to a page of the content is on the TOC, later indexing > will not index that page as the bridging TOC is dropped from the list of > documents (this assumes any pages linking to the TOC have not been > modified since the last run and hence are not re-fetched). This causes the > page to drop from the database, it will only be picked up on the next > full index and dropped again on the next partial index. > > I didn't see that this issue had been discussed before, would this still > be an issue for 3.2x?
I believe this would be a problem with all 3.1.x and 3.2.0x releases. The problem is that when a document is marked as "noindex", it gets removed from db.docdb by htmerge or htpurge, so subsequent update runs of htdig don't check this file for changes (either in it's noindex status, or in the links it can harvest from it) - it's off htdig's radar entirely at that point. I'd consider this a bug, but it's not one with an easy fix. Rather than purging the document completely when tagged as noindex, htmerge/htpurge should purge all words associated with it, as it does now, as well as purging any link descriptions for in in db.docdb, but it should keep the record of this document in db.docdb (possibly flagged somehow) so it doesn't get forgotten. I haven't worked out the potential ramifications of this, as far as possibilities of these records rising from the grave and showing up as false entries in search results or some such. I think to some extent 3.2 already deals with these "zombie" records in htsearch, in case htpurge hasn't run yet, so it shouldn't be too hard to fix this there. -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 _______________________________________________ htdig-dev mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/htdig-dev