George Adams wrote:
> 
> 
> >db.words.db should be generated from scratch from db.wordlist by htmerge.
> >I'm assuming the word is actually in db.wordlist?
> 
> No, actually that's not the case.
> 
> Here is the state after indexing the site while the file containing the keyword 
>"dalek" DOES exist:
> 
> % ls -l
> -rw-rw-r--   1 adams    users       83968 Nov 25 10:20 db.docdb
> -rw-rw-r--   1 adams    users        6144 Nov 25 10:20 db.docs.index
> -rw-rw-r--   1 adams    users      109895 Nov 25 10:20 db.wordlist
> -rw-rw-r--   1 adams    users      117760 Nov 25 10:20 db.words.db
> 
> % grep -l "dalek" *
> db.docdb
> db.wordlist
> db.words.db
> 
> Now I remove the file containing the word "dalek" and reindex the site by running 
>"rundig".
> 
> % ls -l
> -rw-rw-r--   1 adams    users       83968 Nov 25 10:21 db.docdb
> -rw-rw-r--   1 adams    users        6144 Nov 25 10:21 db.docs.index
> -rw-rw-r--   1 adams    users      109740 Nov 25 10:21 db.wordlist
> -rw-rw-r--   1 adams    users      117760 Nov 25 10:21 db.words.db
> 
> % grep -l "dalek" *
> db.words.db

Yes in htsearch/words.cc mergeWords() only opens or creates db.words.db,
so deleted words aren't removed, whatever remove_bad_urls setting is and
as a matter of fact, remove_bad_urls isn't involved here ie:.
htdig -i one URL index.html with foo (one occurrence).

remove foo in index.html
htdig
htmerge
foo isn't removed and it's not a bad url!

I was puzzled by this stuff in a nasty way. I played with locale setting
and was plagued with htsearch finding r�seaux and seaux (this one was
created without locale:fr ). I got ghost hits without changing
documents!

I thing it's a htdig-3.0.8b2 bug too.

a dirty hack:
unlink db.words.db first or use -a options!

Didier

----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the body of the message.

Reply via email to