According to Chris Green:
> I'm currently running into a bit of a problem with memory usage of
> htdig. I'm trying something somewhat odd in my book so let me explain
> first. I've modified Document.cc to handle the files with a type of
> "text/plain" but this problem occurs without my patches applied ( to
> 3.1.5 ).
What sort of modifications did you need? htdig 3.1.5 already does handle
text/plain files, using the htdig/Plaintext.cc parser.
> I have ~250mb of mail, stored 1 msg per file ( with 36000 files )
> archived in my homedir. When I run htdig with the following
> configuration, things are fine for a few minutes and then the htdig
> process undergoes a ``algae bloom'' in memory usage and consumes 600M+
> of memory until it consumes all it can and croaks ( really only did
> this once, didn't take long before I brought out the ulimit ) . This
> occurs even if my patches aren't applied and the files "aren't found
> locally".
>
> Is this typical memory usage with this large # of documents? I have
> used htdig with html of 15000+ documents spreading across 850,000 megs
> of HTML with no problems.
htdig can handle large numbers of files (36000 isn't too many), but
it does seem to run into problems with memory usage when they're all
specified all at once in the start_url. You might want to try putting
the URLs for these pages as hrefs in an HTML file, and give this HTML
file as a start_url. If it still has problems with this, try breaking
it up into several smaller files (e.g. 60 files of 600 URLs). It should
be easy enough to write a script to automate this process.
> My best guess with smaller test cases is that it's choking on encoded
> attachments and grabbing an exponential number of words from the stuff
> that looks like pure garbage [1]. Am I pushing too much to try and
> get this index processing to fit into under 400mb of memory?
Yes, attachments could very well pose problems. If these were in HTML
files, you could use noindex_start and noindex_end to remove some sections,
but with text/plain you may be out of luck unless you can patch the plain
text parser to somehow exclude these.
--
Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba Phone: (204)789-3766
Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html