According to Dan Langille: > On 13 Nov 2002 at 21:48, Gilles Detillieux wrote: > > According to Dan Langille: > > > I have indexed a mailing list archive. My next goal is to nightly > > > update that index by indexing the entire month's archive and then > > > merging that into the main database. At present, there are about 4 > > > years of data. I'm seeking comments on my approach. > > > > You may also want to have a look at how we do it for the htdig-general > > and htdig-dev archives: > > > > http://www.htdig.org/files/contrib/scripts/README.geoupdate-ungeoify > > http://www.htdig.org/files/contrib/scripts/geoupdate.sh > > My results are interesting, and confusing. The problem is > incomplete results. I will explain as I go along. > > My base working directory is /usr/local/htdig. The "production" > databases is: > > [dan@undef:/usr/local/htdig] $ ls -l databases/ > total 214713 > drwxr-xr-x 2 dan dan 512 Nov 20 09:43 adsl-update > -rw-r--r-- 1 dan dan 70942720 Nov 10 11:17 db.docdb > -rw-r--r-- 1 dan dan 1940480 Nov 10 11:17 db.docs.index > -rw-r--r-- 1 dan dan 80104230 Nov 10 11:17 db.wordlist > -rw-r--r-- 1 dan dan 66720768 Nov 10 11:17 db.words.db > > I believe that the initial dig and merge are operating correctly. > The following files are created: > > [dan@undef:/usr/local/htdig] $ ls -l databases/adsl-update/ > total 817 > -rw-r--r-- 1 dan dan 221184 Nov 20 11:00 db.docdb > -rw-r--r-- 1 dan dan 9216 Nov 20 11:00 db.docs.index > -rw-r--r-- 1 dan dan 234216 Nov 20 11:00 db.wordlist > -rw-r--r-- 1 dan dan 344064 Nov 20 11:00 db.words.db > > It is the next merge which is the cause of the problem. The command > is "/usr/local/bin/htmerge -a -c htdig-unixathome.org-adsl.conf -m > adsl-update.conf" issued from /usr/local/htdig. This results in: > > htmerge: Unable to open word list file > '/usr/local/htdig/databases/db.wordlist.work' ... > Why is it expecting a .work file on input?
The reason for the -a on the htmerge command is so that the main database can be updated (in the .work copy), while the actual main database is still available to and usable by htsearch, in case the merge takes a while. Then, once the merge is completed, the resulting files can be moved and/or copied into place rather quickly. > I note that the geoupdate.sh script leaves behinad only db.docdb.work > and db.wordlist.work. Yes, this is because these two files are all that htmerge needs. The other ones (db.words.db and db.docs.index) are generated by htmerge from the first two files. But, because the script leaves these two files around, it also assumes the two files are there to begin with. So, the inital htdig and htmerge that created the main database should have been done with the -a option as well, or the .work files need to be created manually by copying their non-.work counterparts. > To supply the file, I do this in the databases subdirectory: > cp db.wordlist db.wordlist.work > > Running the htmerge then results in incomplete results. > Specifically, the db.docdb.work and db.docs.index.work is way too > small when compared to the original files ... That's because you should have copied db.docdb to db.docdb.work before running htmerge. Without the existing db.docdb.work, it would seem that htmerge simply created one as a starting point, so obviously it's going to be missing a lot of records! > I also tried this approach > > cd /usr/local/htdig/databases > cp db.docdb db.docdb.work > cp db.docs.index db.docs.index.work > cp db.wordlist db.wordlist.work > cp db.words.db db.words.db.work > > Then I reran the merge to obtain a better result: ... > What am I not understanding? htmerge needs both the db.wordlist and db.docdb. If you're using -a, then it needs the .work version of both of these files. When you use htmerge -m, it merges the db.wordlist from the 2nd set into the main one, and merges the db.docdb records from the 2nd set into the first. After that, htmerge generates db.docs.index from the words in db.wordlist (or the .work files of both of these with the -a option), and generates the db.docs.index from the db.docdb records (or the .work files). So, if you're using -a, you should make sure htmerge has the db.docdb.work and db.wordlist.work files for the main database. Also, because htmerge doesn't need the db.wordlist file, it can remain as a .work file. So your main databases directory should have the following files before and after the update script runs: db.wordlist.work - needed by htdig -a and/or htmerge -a db.docdb.work - needed by htdig -a and/or htmerge -a db.docdb - needed by htsearch db.docs.index - needed by htsearch db.words.db - needed by htsearch For the first run of the script, if you have just db.docdb and db.wordlist to begin with, the only commands you'd need are... cd /usr/local/htdig/databases cp db.docdb db.docdb.work mv db.wordlist db.wordlist.work -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) ------------------------------------------------------- This sf.net email is sponsored by: To learn the basics of securing your web site with SSL, click here to get a FREE TRIAL of a Thawte Server Certificate: http://www.gothawte.com/rd524.html _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

