Hi Vassil, See comments below...
On Wed, Dec 9, 2009 at 8:46 AM, Vassil Dichev <[email protected]> wrote: > Markus, > > First of all, a note about the KB/message statistics: this is only > valid as long as you get messages from the cache! Currently the cache > size is set to 10,000, so you will see a drop in memory usage for message numbers, which exceed this size. Yes I expected this. But I think you agree that for high performance with lot's of users we need to take care that we can cache as much as possible. > Processing messages would > also necessarily become slower. > > The simplest strategies for the stemmer would be: > 1. Move the stemmer to the companion object > 2. Create a new stemmer every time it's needed > > By doing a naive test with 100,000 invocations of stem for the same > stemmer and creating 100,000 stemmer objects it seems that > instantiation takes almost double time. So I'm not sure contentioun > would be much of an issue, besides the only time a stemmer is needed > is for search and the word frequency cloud. These are not specific to > a particular message, so can be (and should be) moved to the the > companion object, too. Yes that makes a lot of sense. Is the stemming currently done within the thread that updates the UI? Stemming could be batched then (update the word frequency only all n messages). I would rather like to avoid creating a new stemmer each time. > Furthermore, search is done in a compass > transaction anyway. > > I've also seen that Lucene has some potential issues with Finalizers, e.g. they use large finalizable objects (IndexWriter IIRC). Is the index updated for each message? I think it would also make sense to batch those updated if possible. > We could also have some type of pooling, but I'm not sure how > efficient it would be. This definitely needs some benchmarks before we > try to optimize too much. > > What do you think? > Yes. It's impossible to make decisions about which tradeoffs to make, as long as we don't have an ESME instance with enough activce users running (with detailed enough performance monitoring enabled). I would therefore go for now withe the easiest possible implementation, KISS! Regards, Markus
