According to Dan Langille:
> I've found an instance where a document contains in robots.txt is
> included in the final index. Not sure if this is a bug or a feature.
It's a bug, but it's in your script...
...
> $ more rundig.merge
> #!/bin/sh
...
> $BINDIR/htdig -vvv -c ${CONFIGMERGE}
>
> $BINDIR/htmerge -vvv -c ${CONFIG} -m ${CONFIGMERGE}
This is the problem. As I've mentioned many times on this list before,
you can't go straight from htdig to htmerge -m. You need to run htmerge
in the standard way on the database from htdig before running htmerge -m.
What's happening is when htdig goes to fetch the disallowed document,
it puts a control record in db.wordlist to tell htmerge to purge
this document. But if you don't run htmerge in the normal way, it
doesn't process this control record so the document isn't purged from
the database before you merge it into the new database. Even worse,
because htmerge -m doesn't expect these control records, it sometimes
can put a junk record into the new wordlist, which may in some cases
cause the wrong document to be purged from the database.
You must insert
$BINDIR/htmerge -vvv -c ${CONFIGMERGE}
in your script after running htdig, and before running htmerge -m, to
properly clean up the CONFIGMERGE database before merging it into your
main one.
I think that if I can't easily fix htmerge -m to deal with control
records, I'll have to put a really big warning in the htmerge.html
manual page about this.
--
Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba Phone: (204)789-3766
Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html