Re: [htdig] 2 different search..

Gilles Detillieux Thu, 28 Jun 2001 12:01:49 -0700
According to Pietro Palladino:
> Il 23:08, luned� 25 giugno 2001, hai scritto:
> > According to Pietro Palladino:
> > > Well, I don't know if I really need 2 databases...in the meanwhile I've
> > > indexed  almost 760 documents and more than 450 pages....You can imagine
> > > how slow is htdig now when I try to search something.
> > > This is the reason why I'd like to use 2 different databases...in this
> > > way the searches will be more fast (I suppose).
> >
> > No, I can't imagine.  A total of 760+450 documents is still very small
> > by ht://Dig standards.  Many users index over 10 times as many documents
> > without any problems with searches being too slow.  Either you have a very
> > slow machine, or there's a configuration problem of some sort.  How slow
> > is slow?  How long does htsearch take to do a typical search on your site?
> 
> Too much time, almost 3 minutes!!! I don't know why. However I usually don't 
> start httpd on my RH. I start it only when I need to test htdig.....I noticed 
> that I need to relaunch rundig to have the best performances

OK, let's just make sure we're not confusing indexing time with searching
time.  If it takes rundig or htdig 3 minutes to _index_ all of your
documents, that is nothing unusual.  If it takes htsearch 3 minutes to
_search_ for a word in your database, produced earlier by indexing with
rundig or htdig, then that is unusual.

If you are indeed talking about the second case, then maybe you could
provide more information about what you are searching for.  Is it a
single word, multiple words, or a phrase?  Are the words used frequently
or infrequently in the documents?  Do you get similar results with all
searches, or does search time vary a lot depending on the query and/or
other search options?

> > > 1. Indexing
> > >
> > > Ok, I installed htDig from an RPM file, so I couldn't set variables as
> > > $BINDIR, $DBDIR, etc....anyway, now it works with the default options....
> >
> > These variables determine where things are installed, but you can override
> > all sorts of things in your htdig.conf, including the database_dir,
> > common_dir, and many other attributes.
> >
> 
> 
> Fine!! These are good news....In this way I could put the files I need 
> wherever I want, right?

Yes, pretty much.  The only restriction is that your config files for
htsearch must go in the directory specified by CONFIG_DIR at compile
time.  For RPMs, this is usually /etc/htdig for the RPMs I put together,
and /etc for the RPMs in Red Hat's PowerTools.  The latter is a poor
choice of directory in my opinion, because if you want multiple config
files, they all have to go directly in /etc.  There is a roundabout way
of overriding CONFIG_DIR at run time, using an environment variable, as
described in http://www.htdig.org/FAQ.html#q4.20

> > > so I edit the htdig.conf file and change only some of them:
> > > -external_parsers,
> > > -database_dir,
> > > -start_url,
> > > -search_algorithm--> I kept only "exact:1" 'cause I have neither an
> > > italian dictionary to use with htDig, neither a dictionary with "stop
> > > words"...I've nothing but the default stuff that comes with the htdig
> > > archive :-(
> >
> > See http://www.htdig.org/FAQ.html#q4.10 .  There are many ispell
> > dictionaries available in all sort of languages.  Surely you can find an
> > Italian one. The endings algorithm is a really nice one to have, if you can
> > take a bit of time to configure it for your language.
> >
> 
> 
> :-))) You are great!!! I found italian ispell dictionary, but I don't know 
> how install it...I think that the instructions I found inside the package 
> aren't for htdig, are it?

No, the instructions would be for ispell.  The FAQ 4.10 entry describes
what you need to do.  The .aff file can be copied directly to your
common_dir or lang_dir (if you opt to define the latter).  The italian.0
must be produced by concatenating, sorting and uniq'ing individual word
lists.  You pick the ones you want to use or leave out.

> it for my needs :-).....I'm also trying to understand which databases are 
> created by each program....so, I could understand if there's something that 
> doesn't work fine just looking the databases size.

This depends on the version.  In 3.1.x, htdig creates db.wordlist and
db.docdb, and htmerge does some merging and cleaning up of db.wordlist,
as well as some purging of unused db.docdb records, and then it
creates db.words.db and db.docs.index.  In 3.2, htdig creates db.docdb,
db.docs.excerpts, db.docs.index and db.words.db (which may or may not
also have a _weakcmpr file), and htpurge will prune these down a bit.
There are no real guidelines as to what sizes they should be, but if
they're much, much smaller than your collection of documents, chances
are htdig missed some of them.

> > If you want to reindex from scratch by using htdig -i, that's exactly what
> > rundig does.  You can give -s, -v and -a options to rundig as well, and
> > with -a, rundig will do all the file management needed after htmerge or
> > htpurge. Generally, the only need to move away from rundig to a specialised
> > script is when you want to update an existing database rather than indexing
> > from scratch each time.  For a small collection such as yours, indexing
> > from scratch nightly should be feasible.
> 
> 
> Hmmmm, I'm sorry, but I can't understand what does "indexing from scratch" 
> mean...:-( could you better explain what were you saying? 

When you use the -i option to htdig, it deletes the existing database,
and starts over indexing everything.  From scratch means from nothing.

If you have an existing database, and you don't use -i, htdig will
keep what's there.  It will check all the URLs in the database to see if
they've changed since they were last indexed, and only reindex them if
it has to, or if it encounters links to new documents.

> I've a problem too, though I'm using the -a flag, I can't succeed in 
> searching anything when I launch rundig....where's the mistake? If I use the 
> -a flag, does htsearch automatically manage the search using the .work 
> database?

No, htsearch doesn't have a -a option and never uses the .work files.
If you use -a on htdig, htpurge or htmerge, you must then move or copy
the .work files to the same names without the ".work" suffix, before you
can run htnotify, htfuzzy, or htsearch on them.  The rundig script does
this, as does the http://www.htdig.org/files/contrib/scripts/rundig.sh
script.  The latter one does updates on a copy of the database, rather
than reindexing from scratch each time, as the standard rundig does.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html
Re: [htdig] 2 different search..

Reply via email to