According to Bill Carlson:
> I ran across a logical problem when handling <META name="robots" 
> content="noindex"> on a page. The behavior expected is that links on the 
> page will be followed and indexed. This works fine on the initial index.
> 
> Let's call the page that shouldn't be indexed TOC (Tables Of Contents, a
> typical application)  and pages linked to the TOC are the content.
> 
> If the only link to a page of the content is on the TOC, later indexing 
> will not index that page as the bridging TOC is dropped from the list of 
> documents (this assumes any pages linking to the TOC have not been 
> modified since the last run and hence are not re-fetched). This causes the 
> page to drop from the database, it will only be picked up on the next 
> full index and dropped again on the next partial index.
> 
> I didn't see that this issue had been discussed before, would this still 
> be an issue for 3.2x?

I believe this would be a problem with all 3.1.x and 3.2.0x releases.
The problem is that when a document is marked as "noindex", it gets
removed from db.docdb by htmerge or htpurge, so subsequent update runs
of htdig don't check this file for changes (either in it's noindex
status, or in the links it can harvest from it) - it's off htdig's radar
entirely at that point.

I'd consider this a bug, but it's not one with an easy fix.  Rather than
purging the document completely when tagged as noindex, htmerge/htpurge
should purge all words associated with it, as it does now, as well as
purging any link descriptions for in in db.docdb, but it should keep
the record of this document in db.docdb (possibly flagged somehow) so it
doesn't get forgotten.  I haven't worked out the potential ramifications
of this, as far as possibilities of these records rising from the
grave and showing up as false entries in search results or some such.
I think to some extent 3.2 already deals with these "zombie" records in
htsearch, in case htpurge hasn't run yet, so it shouldn't be too hard
to fix this there.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Reply via email to