Hi Gilles,
I too am having issues performing merges with htmerge. My scenario is a
little different in that I am creating a master database and then merging 2
other databases into it.......
I have followed the threads between you both and have made some progress
but am still having difficulties - part of the issue is I am using a newer
snapshot I think because I don't have the db.wordlist file at all - I am
using version 3.2.0b4 - the files that are created are as follows:
db.docdb
db.docs.index
db.excerpts
db.words.db
db.words.db_weakcmpr
I build the master database with the -a option so the .work files are
created - I have reworked my script to cp the .work files instead of mv them
- however I am not sure which files are required by htmerge - do I need all
the .work files for a successful merge?
Furthermore - I am not clear on what actual commands I need to run - here
is what I'm doing now - am I missing something??
BUILD MASTER DATABASE
rundig -vvv -a -c configfile > htdig.out
cp -p $dbdir/db.docdb.work $dbdir/db.docdb
cp -p $dbdir/db.docs.index.work $dbdir/db.docs.index
cp -p $dbdir/db.excerpts.work $dbdir/db.excerpts
cp -p $dbdir/db.words.db.work $dbdir/db.words.db
cp -p $dbdir/db.words.db.work_weakcmpr $dbdir/db.words.db_weakcmpr
BUILD 1ST MERGE DATABASE
rundig -vvv -a -c configfile2 > htdig2.out
mv $dbdir2/db.docdb.work $dbdir2/db.docdb
mv $dbdir2/db.docs.index.work $dbdir2/db.docs.index
mv $dbdir2/db.excerpts.work $dbdir2/db.excerpts
mv $dbdir2/db.words.db.work $dbdir2/db.words.db
mv $dbdir2/db.words.db.work_weakcmpr $dbdir2/db.words.db_weakcmpr
DO THE MERGE INTO THE MASTER DATABASE
htmerge -a -v -c configfile -m configfile2
COPY THE .WORK FILES BACK TO THE MASTER DB FILES
cp -p $dbdir/db.docdb.work $dbdir/db.docdb
cp -p $dbdir/db.docs.index.work $dbdir/db.docs.index
cp -p $dbdir/db.excerpts.work $dbdir/db.excerpts
cp -p $dbdir/db.words.db.work $dbdir/db.words.db
cp -p $dbdir/db.words.db.work_weakcmpr $dbdir/db.words.db_weakcmpr
BUILD THE 2ND MERGE DATABASE
rundig -vvv -a -c configfile3 > htdig3.out
mv $dbdir3/db.docdb.work $dbdir3/db.docdb
mv $dbdir3/db.docs.index.work $dbdir3/db.docs.index
mv $dbdir3/db.excerpts.work $dbdir3/db.excerpts
mv $dbdir3/db.words.db.work $dbdir3/db.words.db
mv $dbdir3/db.words.db.work_weakcmpr $dbdir3/db.words.db_weakcmpr
DO THE MERGE INTO THE MASTER DATABASE
htmerge -a -v -c configfile -m configfile3
COPY THE .WORK FILES BACK TO THE MASTER DB FILES
cp -p $dbdir/db.docdb.work $dbdir/db.docdb
cp -p $dbdir/db.docs.index.work $dbdir/db.docs.index
cp -p $dbdir/db.excerpts.work $dbdir/db.excerpts
cp -p $dbdir/db.words.db.work $dbdir/db.words.db
cp -p $dbdir/db.words.db.work_weakcmpr $dbdir/db.words.db_weakcmpr
I figured that this would result in a master database containing ALL the
contents of the 2 smaller databases merged in, however I seem to get only the
database for configfile3 successfully merged because when I do a search that
should yield results for configfile2 I get 0 results - yet I get the
appropriate results for configfile2 when I run a search......
I thought using the "rundig" script may be causing issues because it does
some other stuff to the database files, but it seems to merge at least some
of the other databases into the master one.....
I'm quite confused - could you help clear things up for me? Thanks!
Cheers,
Jonathan Schlackl
On Wednesday 20 November 2002 09:04, you wrote:
> According to Dan Langille:
> > On 13 Nov 2002 at 21:48, Gilles Detillieux wrote:
> > > According to Dan Langille:
> > > > I have indexed a mailing list archive. My next goal is to nightly
> > > > update that index by indexing the entire month's archive and then
> > > > merging that into the main database. At present, there are about 4
> > > > years of data. I'm seeking comments on my approach.
> > >
> > > You may also want to have a look at how we do it for the htdig-general
> > > and htdig-dev archives:
> > >
> > > http://www.htdig.org/files/contrib/scripts/README.geoupdate-ungeoify
> > > http://www.htdig.org/files/contrib/scripts/geoupdate.sh
> >
> > My results are interesting, and confusing. The problem is
> > incomplete results. I will explain as I go along.
> >
> > My base working directory is /usr/local/htdig. The "production"
> > databases is:
> >
> > [dan@undef:/usr/local/htdig] $ ls -l databases/
> > total 214713
> > drwxr-xr-x 2 dan dan 512 Nov 20 09:43 adsl-update
> > -rw-r--r-- 1 dan dan 70942720 Nov 10 11:17 db.docdb
> > -rw-r--r-- 1 dan dan 1940480 Nov 10 11:17 db.docs.index
> > -rw-r--r-- 1 dan dan 80104230 Nov 10 11:17 db.wordlist
> > -rw-r--r-- 1 dan dan 66720768 Nov 10 11:17 db.words.db
> >
> > I believe that the initial dig and merge are operating correctly.
> > The following files are created:
> >
> > [dan@undef:/usr/local/htdig] $ ls -l databases/adsl-update/
> > total 817
> > -rw-r--r-- 1 dan dan 221184 Nov 20 11:00 db.docdb
> > -rw-r--r-- 1 dan dan 9216 Nov 20 11:00 db.docs.index
> > -rw-r--r-- 1 dan dan 234216 Nov 20 11:00 db.wordlist
> > -rw-r--r-- 1 dan dan 344064 Nov 20 11:00 db.words.db
> >
> > It is the next merge which is the cause of the problem. The command
> > is "/usr/local/bin/htmerge -a -c htdig-unixathome.org-adsl.conf -m
> > adsl-update.conf" issued from /usr/local/htdig. This results in:
> >
> > htmerge: Unable to open word list file
> > '/usr/local/htdig/databases/db.wordlist.work'
>
> ...
>
> > Why is it expecting a .work file on input?
>
> The reason for the -a on the htmerge command is so that the main database
> can be updated (in the .work copy), while the actual main database
> is still available to and usable by htsearch, in case the merge takes
> a while. Then, once the merge is completed, the resulting files can be
> moved and/or copied into place rather quickly.
>
> > I note that the geoupdate.sh script leaves behinad only db.docdb.work
> > and db.wordlist.work.
>
> Yes, this is because these two files are all that htmerge needs.
> The other ones (db.words.db and db.docs.index) are generated by htmerge
> from the first two files. But, because the script leaves these two
> files around, it also assumes the two files are there to begin with.
> So, the inital htdig and htmerge that created the main database should
> have been done with the -a option as well, or the .work files need to
> be created manually by copying their non-.work counterparts.
>
> > To supply the file, I do this in the databases subdirectory:
> > cp db.wordlist db.wordlist.work
> >
> > Running the htmerge then results in incomplete results.
> > Specifically, the db.docdb.work and db.docs.index.work is way too
> > small when compared to the original files
>
> ...
>
> That's because you should have copied db.docdb to db.docdb.work before
> running htmerge. Without the existing db.docdb.work, it would seem that
> htmerge simply created one as a starting point, so obviously it's going
> to be missing a lot of records!
>
> > I also tried this approach
> >
> > cd /usr/local/htdig/databases
> > cp db.docdb db.docdb.work
> > cp db.docs.index db.docs.index.work
> > cp db.wordlist db.wordlist.work
> > cp db.words.db db.words.db.work
> >
> > Then I reran the merge to obtain a better result:
>
> ...
>
> > What am I not understanding?
>
> htmerge needs both the db.wordlist and db.docdb. If you're using -a,
> then it needs the .work version of both of these files. When you use
> htmerge -m, it merges the db.wordlist from the 2nd set into the main
> one, and merges the db.docdb records from the 2nd set into the first.
> After that, htmerge generates db.docs.index from the words in db.wordlist
> (or the .work files of both of these with the -a option), and generates
> the db.docs.index from the db.docdb records (or the .work files). So,
> if you're using -a, you should make sure htmerge has the db.docdb.work
> and db.wordlist.work files for the main database. Also, because htmerge
> doesn't need the db.wordlist file, it can remain as a .work file.
> So your main databases directory should have the following files before
> and after the update script runs:
>
> db.wordlist.work - needed by htdig -a and/or htmerge -a
> db.docdb.work - needed by htdig -a and/or htmerge -a
> db.docdb - needed by htsearch
> db.docs.index - needed by htsearch
> db.words.db - needed by htsearch
>
>
> For the first run of the script, if you have just db.docdb and db.wordlist
> to begin with, the only commands you'd need are...
>
> cd /usr/local/htdig/databases
> cp db.docdb db.docdb.work
> mv db.wordlist db.wordlist.work
-------------------------------------------------------
This sf.net email is sponsored by:
Battle your brains against the best in the Thawte Crypto
Challenge. Be the first to crack the code - register now:
http://www.gothawte.com/rd521.html
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html