On 13 Nov 2002 at 21:48, Gilles Detillieux wrote:

> According to Dan Langille:
> > I have indexed a mailing list archive.  My next goal is to nightly
> > update that index by indexing the entire month's archive and then
> > merging that into the main database.  At present, there are about 4
> > years of data.  I'm seeking comments on my approach.
> 
> You may also want to have a look at how we do it for the htdig-general
> and htdig-dev archives:
> 
> http://www.htdig.org/files/contrib/scripts/README.geoupdate-ungeoify
> http://www.htdig.org/files/contrib/scripts/geoupdate.sh

My results are interesting, and confusing.  The problem is  
incomplete results.  I will explain as I go along.

My base working directory is /usr/local/htdig.  The "production" 
databases is:

[dan@undef:/usr/local/htdig] $ ls -l databases/
total 214713
drwxr-xr-x  2 dan  dan       512 Nov 20 09:43 adsl-update
-rw-r--r--  1 dan  dan  70942720 Nov 10 11:17 db.docdb
-rw-r--r--  1 dan  dan   1940480 Nov 10 11:17 db.docs.index
-rw-r--r--  1 dan  dan  80104230 Nov 10 11:17 db.wordlist
-rw-r--r--  1 dan  dan  66720768 Nov 10 11:17 db.words.db

I believe that the initial dig and merge are operating correctly.  
The following files are created:

[dan@undef:/usr/local/htdig] $ ls -l databases/adsl-update/
total 817
-rw-r--r--  1 dan  dan  221184 Nov 20 11:00 db.docdb
-rw-r--r--  1 dan  dan    9216 Nov 20 11:00 db.docs.index
-rw-r--r--  1 dan  dan  234216 Nov 20 11:00 db.wordlist
-rw-r--r--  1 dan  dan  344064 Nov 20 11:00 db.words.db

It is the next merge which is the cause of the problem.  The command 
is "/usr/local/bin/htmerge -a -c htdig-unixathome.org-adsl.conf -m 
adsl-update.conf" issued from /usr/local/htdig.  This results in:

   htmerge: Unable to open word list file
   '/usr/local/htdig/databases/db.wordlist.work'

That I suspect is the result of the -a option on htmerge.  The 
documentation says:

Use alternate work files. Tells htdig to append .work to database 
files, causing a second copy of the database to be built. This allows 
the original files to be used by htsearch during the indexing run.

Why is it expecting a .work file on input?

I note that the geoupdate.sh script leaves behinad only db.docdb.work 
and db.wordlist.work.

To supply the file, I do this in the databases subdirectory:
cp db.wordlist db.wordlist.work

Running the htmerge then results in incomplete results.  
Specifically, the db.docdb.work and db.docs.index.work is way too 
small when compared to the original files

[dan@undef:/usr/local/htdig] $ ls -l databases/
total 358226
drwxr-xr-x  3 dan  dan       512 Nov 20 11:02 adsl-update
-rw-r--r--  1 dan  dan  70942720 Nov 10 11:17 db.docdb
-rw-r--r--  1 dan  dan    209920 Nov 20 11:09 db.docdb.work
-rw-r--r--  1 dan  dan   1940480 Nov 10 11:17 db.docs.index
-rw-r--r--  1 dan  dan      9216 Nov 20 11:09 db.docs.index.work
-rw-r--r--  1 dan  dan  80104230 Nov 10 11:17 db.wordlist
-rw-r--r--  1 dan  dan  79987264 Nov 20 11:09 db.wordlist.work
-rw-r--r--  1 dan  dan  66720768 Nov 10 11:17 db.words.db
-rw-r--r--  1 dan  dan  66639872 Nov 20 11:09 db.words.db.work

I also tried this approach

cd /usr/local/htdig/databases
cp db.docdb db.docdb.work
cp db.docs.index db.docs.index.work
cp db.wordlist db.wordlist.work
cp db.words.db db.words.db.work

Then I reran the merge to obtain a better result:

[dan@undef:/usr/local/htdig/databases] $ ls -l
total 429961
drwxr-xr-x  3 dan  dan       512 Nov 20 11:02 adsl-update
-rw-r--r--  1 dan  dan  70942720 Nov 10 11:17 db.docdb
-rw-r--r--  1 dan  dan  71108608 Nov 20 11:15 db.docdb.work
-rw-r--r--  1 dan  dan   1940480 Nov 10 11:17 db.docs.index
-rw-r--r--  1 dan  dan   1945600 Nov 20 11:15 db.docs.index.work
-rw-r--r--  1 dan  dan  80104230 Nov 10 11:17 db.wordlist
-rw-r--r--  1 dan  dan  80314085 Nov 20 11:15 db.wordlist.work
-rw-r--r--  1 dan  dan  66720768 Nov 10 11:17 db.words.db
-rw-r--r--  1 dan  dan  66883584 Nov 20 11:15 db.words.db.work

What am I not understanding?

Thank you.
-- 
Dan Langille : http://www.langille.org/



-------------------------------------------------------
This sf.net email is sponsored by: To learn the basics of securing 
your web site with SSL, click here to get a FREE TRIAL of a Thawte 
Server Certificate: http://www.gothawte.com/rd524.html
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to