According to Michael Olds:
> Hello, before I get onto the list I would like to ask if the members think
> that this program is overkill for a small site 10449KB possible to grow to
> twice that.

I use it on a site that's not much bigger than that: less than 500
documents totalling maybe 34 MB.  It can handle much more than that,
but it's fine for small sites too.

> I have one other consideration: about a third of the site is in a dead
> language with diacritical marks (the font was made by myself and follows no
> convention). I would need to be able to do something like Alt+0169 = e for a
> number of characters for an index.

This could pose a problem.  Support for other languages currently
depends on UNIX/Linux "locale" support, so you'd need a locale
definition for this language - at the very least for the character
set you use, so that htdig can know what's a letter and what isn't.
See http://www.htdig.org/FAQ.html#q4.10 for more information.  I don't
know what's involved in making your own locale definition.

> I am wondering also, if it might be worth waiting for the next generation,
> which I understand will create the indexes on the fly so as to save disk
> usage. What is the ratio of original material to indexed space used?

I don't think you will save any disk space using the 3.2 betas.
If anything, they may use a bit more.  For a small site like yours, that
shouldn't be a big concern, though, should it?  (My databases are under
13 MB.)  The actual ratio will vary depending on how much data actually
gets kept.  I've got max_head_length set to 50000, so it stores the
first 50k of each document for excerpts, plus size of the word databases.
I also have a large max_doc_size setting to avoid document truncation.
The savings I get in space stem from the fact that I have some rather
large PDFs which I index, and the amount of indexable text from them is
small compared to the document size (lots of embedded figures).

The advantage to indexing on the fly in 3.2 is that your database isn't
out of commission for searching, while the update is going on.  This too
may not be a big deal for a small site like yours.  On my site, I use
local_urls to index via the local filesystem rather than by HTTP access,
and htdig & htmerge run in under 5 minutes for a complete reindexing
from scratch.  If I were to schedule update runs rather than building
from scratch, it would be a whole lot quicker still.

The 3.2 betas also have phrase searching, so if that's important to you,
that may be the most compelling reason to use it.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
Information: http://lists.sourceforge.net/lists/listinfo/htdig-general
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to