I am building a simple classifier system, using Lucene essentially to efficiently+incrementally calculate term frequencies. (due to input variations, I am currently creating a separate index for each attribute, although I guess I could (should?) just use different field for each attribute)
Now, one potential problem I have is that although memory usage is probably sub-linear (I just index terms, don't store; vocabulary grows sub-linearly), and thus actual memory used should not grow too fast, the way Lucene builds and merges indexes fluctuates: I assume memory usage mostly changes when merging segments. I have simple diagnostics for memory usage that force gc every 1000 documents processed [yes, I know that System.gc() does not strictly guarantee it, but in practice it is good enough], and notice usage fluctuating it a bit, with overall increase. but 10% drop every 12000 documents or so, with default settings). So... I am essentially wondering if there are good techniques for tuning memory usage (minimize index structure size) adaptively, to avoid running out of memory, in cases where compacting the index would avoid out of mem case. Further, are there possibilities to perhaps trade reduced memory usage for slightly slower indexing? (or even better, searching -- in my case, I only traverse term indexes to get counts). IndexWriter.optimize() probably does not really help here does it? -+ Tatu +- __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]