Re: [htdig] merging two databases - am I doing this right?

Gilles Detillieux Wed, 20 Nov 2002 09:13:13 -0800

According to Dan Langille:
> On 13 Nov 2002 at 21:48, Gilles Detillieux wrote:
> > According to Dan Langille:
> > > I have indexed a mailing list archive.  My next goal is to nightly
> > > update that index by indexing the entire month's archive and then
> > > merging that into the main database.  At present, there are about 4
> > > years of data.  I'm seeking comments on my approach.
> > 
> > You may also want to have a look at how we do it for the htdig-general
> > and htdig-dev archives:
> > 
> > http://www.htdig.org/files/contrib/scripts/README.geoupdate-ungeoify
> > http://www.htdig.org/files/contrib/scripts/geoupdate.sh
> 
> My results are interesting, and confusing.  The problem is  
> incomplete results.  I will explain as I go along.
> 
> My base working directory is /usr/local/htdig.  The "production" 
> databases is:
> 
> [dan@undef:/usr/local/htdig] $ ls -l databases/
> total 214713
> drwxr-xr-x  2 dan  dan       512 Nov 20 09:43 adsl-update
> -rw-r--r--  1 dan  dan  70942720 Nov 10 11:17 db.docdb
> -rw-r--r--  1 dan  dan   1940480 Nov 10 11:17 db.docs.index
> -rw-r--r--  1 dan  dan  80104230 Nov 10 11:17 db.wordlist
> -rw-r--r--  1 dan  dan  66720768 Nov 10 11:17 db.words.db
> 
> I believe that the initial dig and merge are operating correctly.  
> The following files are created:
> 
> [dan@undef:/usr/local/htdig] $ ls -l databases/adsl-update/
> total 817
> -rw-r--r--  1 dan  dan  221184 Nov 20 11:00 db.docdb
> -rw-r--r--  1 dan  dan    9216 Nov 20 11:00 db.docs.index
> -rw-r--r--  1 dan  dan  234216 Nov 20 11:00 db.wordlist
> -rw-r--r--  1 dan  dan  344064 Nov 20 11:00 db.words.db
> 
> It is the next merge which is the cause of the problem.  The command 
> is "/usr/local/bin/htmerge -a -c htdig-unixathome.org-adsl.conf -m 
> adsl-update.conf" issued from /usr/local/htdig.  This results in:
> 
>    htmerge: Unable to open word list file
>    '/usr/local/htdig/databases/db.wordlist.work'
...
> Why is it expecting a .work file on input?


The reason for the -a on the htmerge command is so that the main database
can be updated (in the .work copy), while the actual main database
is still available to and usable by htsearch, in case the merge takes
a while.  Then, once the merge is completed, the resulting files can be
moved and/or copied into place rather quickly.

> I note that the geoupdate.sh script leaves behinad only db.docdb.work 
> and db.wordlist.work.

Yes, this is because these two files are all that htmerge needs.
The other ones (db.words.db and db.docs.index) are generated by htmerge
from the first two files.  But, because the script leaves these two
files around, it also assumes the two files are there to begin with.
So, the inital htdig and htmerge that created the main database should
have been done with the -a option as well, or the .work files need to
be created manually by copying their non-.work counterparts.

> To supply the file, I do this in the databases subdirectory:
> cp db.wordlist db.wordlist.work
> 
> Running the htmerge then results in incomplete results.  
> Specifically, the db.docdb.work and db.docs.index.work is way too 
> small when compared to the original files
...

That's because you should have copied db.docdb to db.docdb.work before
running htmerge.  Without the existing db.docdb.work, it would seem that
htmerge simply created one as a starting point, so obviously it's going
to be missing a lot of records!

> I also tried this approach
> 
> cd /usr/local/htdig/databases
> cp db.docdb db.docdb.work
> cp db.docs.index db.docs.index.work
> cp db.wordlist db.wordlist.work
> cp db.words.db db.words.db.work
> 
> Then I reran the merge to obtain a better result:
...
> What am I not understanding?

htmerge needs both the db.wordlist and db.docdb.  If you're using -a,
then it needs the .work version of both of these files.  When you use
htmerge -m, it merges the db.wordlist from the 2nd set into the main
one, and merges the db.docdb records from the 2nd set into the first.
After that, htmerge generates db.docs.index from the words in db.wordlist
(or the .work files of both of these with the -a option), and generates
the db.docs.index from the db.docdb records (or the .work files).  So,
if you're using -a, you should make sure htmerge has the db.docdb.work
and db.wordlist.work files for the main database.  Also, because htmerge
doesn't need the db.wordlist file, it can remain as a .work file.
So your main databases directory should have the following files before
and after the update script runs:

db.wordlist.work        - needed by htdig -a and/or htmerge -a
db.docdb.work           - needed by htdig -a and/or htmerge -a
db.docdb                - needed by htsearch
db.docs.index           - needed by htsearch
db.words.db             - needed by htsearch


For the first run of the script, if you have just db.docdb and db.wordlist
to begin with, the only commands you'd need are...

cd /usr/local/htdig/databases
cp db.docdb db.docdb.work
mv db.wordlist db.wordlist.work

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)


-------------------------------------------------------
This sf.net email is sponsored by: To learn the basics of securing 
your web site with SSL, click here to get a FREE TRIAL of a Thawte 
Server Certificate: http://www.gothawte.com/rd524.html
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Re: [htdig] merging two databases - am I doing this right?

Reply via email to