According to Olivier Korn:
> I thought that Gilles's patch was the solution to all my "htmerge -m"
> problems but I was wrong.
> 
> You'll find attached to this mail a set of three conf-files in order to try
> and see the remaining bug.
> 
> * Description of config files :
> site1.conf and site2.conf dig two different web sites, each of them includes
> the third : include.conf.
> 
> * Actions to take in order to "activate" the problem :
> (Note : I'm using the latest patch that Gilles gave to us).
>          (bindir)/htdig -c (confdir)/site1.conf
>          LC_COLLATE=C (bindir)/htmerge -c (confdir)/site1.conf
>          (cgibindir)/htsearch -c (confdir)/site1.conf words=sujet+oral+aca
> 
> At this point, htsearch returns four links, all of which are relevant. The
>  
> first
> two are the best (especially the second, don't be fooled by the 1 star
> ranking).
> 
>          (bindir)/htdig -c (confdir)/site2.conf
>          LC_COLLATE=C (bindir)/htmerge -c (confdir)/site1.conf -m 
>(confdir)/site2.conf
>          (cgibindir)/htsearch -c (confdir)/site1.conf words=sujet+oral+aca
> 
> Now, htsearch returns only two links (first and fourth of previous search)
> 
> I don't think htsearch should be loosing any link. What am I doing wrong ?
> Is it another bug I found ?

I don't know if it's so much a bug as a limitation in the merging code.
I've long suspected it, but my inspection of the merging code has now
confirmed it.  You absolutely MUST run htmerge on each separate database
individually before you merge any of them with htmerge -m.  The reason for
this is simple: the merging code expects a cleaned up wordlist, because
it doesn't properly handle the "-", "+" and "!" records in a raw wordlist
from htdig.  What happens when it sees these while merging is it thinks of
them as malformed word records, and passes them through as is, but with
a few "missing" fields added to them.  As a result, any "-" records from
the second database (site2.conf) will delete the corresponding document
IDs from the first database, which is the cause of the lost documents.
(You might be able to get away with not running htmerge on the first
database before merging in the second, but it's absolutely necessary to
run htmerge on the second database before merging it into the first.)
If you run htmerge on each dababase first, then merge them together,
you'll find that the 4 matching documents are still there.

E.g.:
 (bindir)/htdig -c (confdir)/site1.conf
 LC_COLLATE=C (bindir)/htmerge -c (confdir)/site1.conf
 (cgibindir)/htsearch -c (confdir)/site1.conf words=sujet+oral+aca
 (bindir)/htdig -c (confdir)/site2.conf
 LC_COLLATE=C (bindir)/htmerge -c (confdir)/site2.conf
 LC_COLLATE=C (bindir)/htmerge -c (confdir)/site1.conf -m (confdir)/site2.conf
 (cgibindir)/htsearch -c (confdir)/site1.conf words=sujet+oral+aca

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to