I am building a simple classifier system, using Lucene
essentially to efficiently+incrementally calculate
term frequencies.
(due to input variations, I am currently creating a
separate index for each attribute, although I guess I
could (should?) just use different field for each
attribute)

Now, one potential problem I have is that although
memory usage is probably sub-linear (I just index
terms, don't store; vocabulary grows sub-linearly),
and thus actual memory used should not grow too fast,
the way Lucene builds and merges indexes fluctuates: I
assume memory usage mostly changes when merging
segments. I have simple diagnostics for memory usage
that force gc every 1000 documents processed [yes, I
know that System.gc() does not strictly guarantee it,
but in practice it is good enough], and notice usage
fluctuating it a bit, with overall increase. but 10%
drop every 12000 documents or so, with default
settings).

So... I am essentially wondering if there are good
techniques for tuning memory usage (minimize index
structure size) adaptively, to avoid running out of
memory, in cases where compacting the index would
avoid out of mem case.

Further, are there possibilities to perhaps trade
reduced memory usage for slightly slower indexing? (or
even better, searching -- in my case, I only traverse
term indexes to get counts). IndexWriter.optimize()
probably does not really help here does it?

-+ Tatu +-


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to