Hello, Very quick comments.
----- Original Message ---- > From: Justus Pendleton <[EMAIL PROTECTED]> > To: java-user@lucene.apache.org > Sent: Sunday, November 2, 2008 10:42:52 PM > Subject: Performance of never optimizing > > Howdy, > > I have a couple of questions regarding some Lucene benchmarking and what the > results mean[3]. (Skip to the numbered list at the end if you don't want to > read > the lengthy exegesis :) > > I'm a developer for JIRA[1]. We are currently trying to get a better > understanding of Lucene, and our use of it, to cope with the needs of our > larger > customers. These "large" indexes are only a couple hundred thousand documents > but our problem is compounded by the fact that they have a relatively high > rate > of modification (=delete+insert of new document) and our users expect these > modification to show up in query results pretty much instantly. This will be a tough call with large indices - there is no real-time search in Lucene yet. > Our current default behaviour is a merge factor of 4. We perform an > optimization > on the index every 4000 additions. We also perform an optimize at midnight. > Our I wouldn't optimize every 4000 additions - you are killing IO, rewriting the whole index, while trying to provide fast searches, plus you are locking the index for other modifications. > fundamental problem is that these optimizations are locking the index for > unacceptably long periods of time, something that we want to resolve for our > next major release, hopefully without undermining search performance too > badly. Why are you optimizing? Trying to make the search faster? I would try to avoid optimizing during high usage periods. > In the Lucene javadoc there is a comment, and a link to a mailing list > discussion[2], that suggests applications such as JIRA should never perform > optimize but should instead set their merge factor very low. Right, you can let Lucene merge segments. > In an attempt to understand the impact of a) lowering the merge factor from 4 > to > 2 and b) never, ever optimizing on an index (over the course of years and > millions of additions/updates) I wanted to try to benchmark Lucene. One thing that you might not have tried is the constant re-opening of the IndexReader, which you'll need to do if you want to see index changes instantly. > I used the contrib/benchmark framework and wrote a small algorithm that adds > documents to an index (using the Reuters doc generator), does a search, does > an > optimize, then does another search. All the pretty pictures can be seen at: So you indexed once and then measured search performance? Or did you measure indexing performance? I can't quite tell from your email. And in one case you optimized before searching and in the other you did not optimize? > http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs > > I have several questions, hopefully they aren't overwhelming in their > quantity > :-/ > > 1. Why does the merge factor of 4 appear to be faster than the merge factor > of > 2? Faster for indexing or searching? If indexing, then it's because 4 means fewer segment merges than 2. If searching, then I don't know, unless you had indexing and searching happening in parallel, which then means less IO for 4. Did you index fit in RAM, by the way? > 2. Why does non-optimized searching appear to be faster than optimized > searching > once the index hits ~500,000 documents? Not sure without seeing the index/machine. It sounds like you were measuring search performance while at the same time increasing the index size by incrementally adding more docs? > 3. There appears to be a fairly sizable performance drop across the board > around > 450,000 documents. Why is that? Something to do with Lucene merging index segments around that point? At this point I'm assuming you were measuring search speed while indexing. > 4. Searching performance appears to decrease towards a fairly pessimistic 20 > searches per second (for a relatively simple search). Is this really what we > should expect long-term from Lucene? 20 reqs/sec sounds very low. How large is your index, how much RAM, and how about heap size? What were your queries like? random? from log? > 5. Does my benchmark even make sense? I am far from an expert on benchmarking > so > it is possible I'm not measuring what I think I am measuring. I'm confused by what exactly you did and measured, but it could just be that I'm tired. > Thanks in advance for any insight you can provide. This is an area that we > very > much want to understand better as Lucene is a key part of JIRA's success, > > [1]: http://www.atlassian.com > [2]: http://www.gossamer-threads.com/lists/lucene/java-dev/47895 > [3]: http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]