Re: [htdig] 2 different search..

Gilles Detillieux Mon, 25 Jun 2001 13:56:17 -0700
According to Pietro Palladino:
> Il 00:05, gioved� 21 giugno 2001, hai scritto:
> > According to Pietro Palladino:
> > > Problem 1: my site has both html pages and documents (.doc, .rtf, etc.).
> > > Is there a way to index html pages in a database and documents in a
> > > different one?
> > >
> > > Problem 2: I'd like to search a word only in one of the two databases,
> > > how could I implement this kind of search?
> >
> > First of all, are you sure you need 2 databases?  You could do all
> > this using the restrict and exclude input parameters to htsearch,
> > to select which file extensions will be allowed in search results.
> > See http://www.htdig.org/FAQ.html#q4.20
> 
> 
> Well, I don't know if I really need 2 databases...in the meanwhile I've 
> indexed  almost 760 documents and more than 450 pages....You can imagine how 
> slow is htdig now when I try to search something. 
> This is the reason why I'd like to use 2 different databases...in this way 
> the searches will be more fast (I suppose).

No, I can't imagine.  A total of 760+450 documents is still very small
by ht://Dig standards.  Many users index over 10 times as many documents
without any problems with searches being too slow.  Either you have a very
slow machine, or there's a configuration problem of some sort.  How slow
is slow?  How long does htsearch take to do a typical search on your site?

> > If you still want to keep 2 separate databases, it can be a bit tricky.
> > excluding the .doc, .rtf, etc. from the HTML page DB is easy - just add
> > these to bad_extensions.  Making a DB that excludes the HTML pages it
> > a bit harder, because normally you count on the HTML links to find your
> > way to the other documents.  You'd need to build a list of URLs for the
> > documents you want for this 2nd DB, as shown towards the end of FAQ 5.25.
> > Then, you can select one of 2 config files from the search form, using
> > the config input parameter (see http://www.htdig.org/hts_form.html), where
> > each config file defines the database_dir or database_base for its own DB.
> 
> 
> Ok, I'll try.....but first I need to understand some things 'cause my ideas 
> are a little bit confused now....Please, follow me in these steps....let's 
> begin from the beginning :-) :
> 
> 1. Indexing
> 
> Ok, I installed htDig from an RPM file, so I couldn't set variables as 
> $BINDIR, $DBDIR, etc....anyway, now it works with the default options....

These variables determine where things are installed, but you can override
all sorts of things in your htdig.conf, including the database_dir,
common_dir, and many other attributes.

> so I edit the htdig.conf file and change only some of them:
> -external_parsers,
> -database_dir,
> -start_url,
> -search_algorithm--> I kept only "exact:1" 'cause I have neither an italian 
> dictionary to use with htDig, neither a dictionary with "stop words"...I've 
> nothing but the default stuff that comes with the htdig archive :-(

See http://www.htdig.org/FAQ.html#q4.10 .  There are many ispell dictionaries
available in all sort of languages.  Surely you can find an Italian one.
The endings algorithm is a really nice one to have, if you can take a bit of
time to configure it for your language.

> Well, let's suppose that I don't want to use the rundig script, what are the 
> steps I've to follow to index my site?
> Ok, I think: "htdig -i -a -s -v"......
> "-i" because I want to erase any previous indexing and I want to rebuild the 
> databases;
> "-a" because I want to use the search engine when it is reindexing my site, 
> so I reindex the site on a second copy of the databases.
> 
> Is all right until this point?

Sure, but so far you're not doing anything you can't do with rundig.  If you
want to reindex from scratch by using htdig -i, that's exactly what rundig
does.  You can give -s, -v and -a options to rundig as well, and with -a,
rundig will do all the file management needed after htmerge or htpurge.
Generally, the only need to move away from rundig to a specialised script
is when you want to update an existing database rather than indexing from
scratch each time.  For a small collection such as yours, indexing from
scratch nightly should be feasible.

> Ok, now I need htpurge: "htpurge  -a -v"
> Now I've a question: What does it purge????

It purges URLs that were added to the database because of links that htdig
encountered, but that htdig never actually indexed in the end.  There are
numerous reasons for this: bad links, servers down, indexing turned off by
meta tags in documents, etc.

> When I use this program, I have messages like these:
> 
> htpurge: 1040
> htpurge: 1050
> htpurge: 1060
> htpurge: 1070
> Deleted, not found: ID: 813 URL: 
> http://www.unina.it/universit/concorsi/borse_ric/bandi/OLD/OLD/scalim1.doc
> Deleted, not found: ID: 973 URL: 
> http://www.unina.it/universit/concorsi/personaleTA/ortob.doc
> Deleted, not found: ID: 1040 URL: http://www.unina.it/rete/citta/repertori.php
> 
> Ok, I think, it didn't found that files so it deleted them from the 
> database.... but if I run again htpurge, I obtain the same messages 
> again...So? What does it purge? Mah...

That's odd.  It may be a bug.  I don't understand the new 3.2 code enough
right now to give a meaningful response, but it seems to me that if it's
truly deleting records from the database, as it claims, then the error
shouldn't come up again when you rerun htpurge.  It would be different
if you reran htdig -i before rerunning htpurge, in which case it would
reencounter the bad links and put them in the database again for htpurge
to delete again.

> I have a different messages too that don't appear when I run htpurge the 
> second time:
> 
> htpurge: Discarding affari
> htpurge: Discarding agenzie
> htpurge: Discarding allegato
> htpurge: Discarding allegato
> htpurge: Discarding allegato
> htpurge: Discarding amministrazioni
> htpurge: Discarding apporre
> htpurge: Discarding area
> htpurge: Discarding arte
> 
> What does they mean?

Most likely these are words that appeared in link description text for
the links to the documents that were purged.  If it purges the documents
from the database, it must also purge any stored words that are tied to
those documents.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html
Re: [htdig] 2 different search..

Reply via email to