--- Wolfgang Hoschek <[EMAIL PROTECTED]> wrote: > Hi Tatu, > > I take it that simply maintaining the frequencies in > a hashmap > similar to > org.apache.lucene.index.memory.AnalyzerUtil.getMostFrequentTerms() > > isn't sufficient for your usecases?
Initially it might, but probably eventually not. I was thinking Lucene formats might also be bit more compact than vanilla hash maps, but I guess that depends on many factors. But I will probably want to play with actual queries later on, based on frequencies. > In the latter case, are you using > org.apache.lucene.store.RAMDirectory or > org.apache.lucene.index.memory.MemoryIndex? I'm using RAMDirectory. Should I be using MemoryIndex maybe instead (I'll check it out)? Thanks! -+ Tatu +- > > Wolfgang. > > On Feb 10, 2006, at 12:29 PM, Tatu Saloranta wrote: > > > I am building a simple classifier system, using > Lucene > > essentially to efficiently+incrementally calculate > > term frequencies. > > (due to input variations, I am currently creating > a > > separate index for each attribute, although I > guess I > > could (should?) just use different field for each > > attribute) > > > > Now, one potential problem I have is that although > > memory usage is probably sub-linear (I just index > > terms, don't store; vocabulary grows > sub-linearly), > > and thus actual memory used should not grow too > fast, > > the way Lucene builds and merges indexes > fluctuates: I > > assume memory usage mostly changes when merging > > segments. I have simple diagnostics for memory > usage > > that force gc every 1000 documents processed [yes, > I > > know that System.gc() does not strictly guarantee > it, > > but in practice it is good enough], and notice > usage > > fluctuating it a bit, with overall increase. but > 10% > > drop every 12000 documents or so, with default > > settings). > > > > So... I am essentially wondering if there are good > > techniques for tuning memory usage (minimize index > > structure size) adaptively, to avoid running out > of > > memory, in cases where compacting the index would > > avoid out of mem case. > > > > Further, are there possibilities to perhaps trade > > reduced memory usage for slightly slower indexing? > (or > > even better, searching -- in my case, I only > traverse > > term indexes to get counts). > IndexWriter.optimize() > > probably does not really help here does it? > > > > -+ Tatu +- > > > > > > __________________________________________________ > > Do You Yahoo!? > > Tired of spam? Yahoo! Mail has the best spam > protection around > > http://mail.yahoo.com > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: > [EMAIL PROTECTED] > > For additional commands, e-mail: > [EMAIL PROTECTED] > > > > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]