I'm currently running into a bit of a problem with memory usage of
htdig.  I'm trying something somewhat odd in my book so let me explain
first.  I've modified Document.cc to handle the files with a type of
"text/plain" but this problem occurs without my patches applied ( to
3.1.5 ).

I have ~250mb of mail, stored 1 msg per file ( with 36000 files )
archived in my homedir.  When I run htdig with the following
configuration, things are fine for a few minutes and then the htdig
process undergoes a ``algae bloom'' in memory usage and consumes 600M+
of memory until it consumes all it can and croaks ( really only did
this once, didn't take long before I brought out the ulimit ) .  This
occurs even if my patches aren't applied and the files "aren't found
locally".

Is this typical memory usage with this large # of documents?   I have
used htdig with html of 15000+ documents spreading across 850,000 megs
of HTML with no problems.

My best guess with smaller test cases is that it's choking on encoded
attachments and grabbing an exponential number of words from the stuff
that looks like pure garbage [1].  Am I pushing too much to try and
get this index processing to fit into under 400mb of memory?

How should I handle this stuff? Should I have a patch that sets the
type as a mail and only show the text/plain stuff from email?  My goal
is to be able to use this with Gnus in emacs with nnir.el [2].  I
really do not wish to retreat to freeWAIS-sf

my configuration:

database_dir:           /home/sprout/tmp/htdig/db
start_url:              `/home/sprout/tmp/htdig/files`
bad_word_list:          /home/sprout/tmp/htdig/bad_words
local_urls:             http://localhost/=/home/sprout/Mail/
local_urls_only:        true
maximum_pages:          1
compression_level:      6
max_hop_count:          0


I build a list of files (eg: http://localhost/outbox/3424) using find
/ perl ) for the start_url

Footnotes: 
[1]  ///////////////////////////////////spcEANyAJBAAA+BK

[2]  ftp://ls6-ftp.cs.uni-dortmund.de/pub/src/emacs/nnir.el

-- 
Chris Green <[EMAIL PROTECTED]>
I've had a perfectly wonderful evening. But this wasn't it.
     -- Groucho Marx

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to