According to Malka Cymbalista:
> I have been running htdig 3.1.3 for quite some time and decided it
> was time to upgrade so I installed 3.2.0b4.  the compilation went fine
> (I am running on a olaris 2.7 machine and I had to use GNU make which
> I knew to do from this list - Thanks).
> I am now trying to index.  When running 3.1.3 it took approximately 6
> hours to index the entire site.  Under 3.2 the indexing takes forevere.
> I let it run for 14 hours and then stopped it.  I believe it indexed
> well under half of the pages.

Well, 3.1.3 to 3.2.0b4 is a big jump!  It's hard to say why exactly
it's taking so much longer to index, but it does seem surprising that
it would take 5 times as long.  I think 2-3 times as long was more what
others had reported, except in very unusual circumstances (like regex
conflicts on BSDi).

There are a number of changes that have introduced delays in indexing.
Probably the most substantial one is the new database formats and direct
word database updates while indexing.  Other changes are more config
attribute lookups and reparsing due to URL blocks, more string copying,
and HTML parser changes (I've found that the parser changes added about
15% to indexing time going from 3.1.5 to 3.1.6, so probably it would
add at least that much from 3.1.3 to 3.2.0b4).  The only way to know for
sure where htdig is spending most of its time on your system would be to
do some profiling.  Profiling on BSDi revealed the aforementioned regex
conflicts, which can be solved with a few simple changes when compiling.
Profiling on your system may reveal other bottlenecks.

> I saw in the FAQ that this is a known problem.  The FAQ suggests
> increasing the wordlist_cache_size attribute.  Where do I set the
> wordlist_cache_size attribute (it is not documented in the list
> of attributes for the configuration file). And what is considered
> a large size.  I read in the ChangeLog that the default is 10Meg.
> To what should I increase it?

I think the optimal setting depends a lot on how much RAM you have on
your system, and you pretty much need to figure out the best setting by
trial and error.  As to where to set this attribute, it's the same as any
other config attribute - the only place to set it is in your config file
(htdig.conf by default, or whatever file you specify via -c).

It is documented, as are other 3.2-specific attributes, but you may
not be looking in the right place.  http://www.htdig.org/attrs.html
documents the latest stable release, i.e. 3.1.5 at this time.
You need to look at the htdoc subdirectory in your source, or
http://www.htdig.org/dev/htdig-3.2/attrs.html.

> Is theer anything else I can do to get it to index faster or should
> I simply install 3.l.5?  I index every night and more than 8-9 hours
> is unacceptable.

If you don't need the extra features of the 3.2 betas, I'd suggest trying
the 3.1.6 snapshot, which fixes a number of bugs in 3.1.5.  It will be
slower than 3.1.3, but much more solid, and should still be well within
your time requirements.

You may also want to try some simple tricks that can really speed things
up.  If you reindex from scratch every day (i.e. rundig or htdig -i), maybe
try update digs and only reindex from scratch once a week or once a month.
If you can use local_urls, but aren't, you should try it.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to