Hello,

 
Very quick comments.


----- Original Message ----
> From: Justus Pendleton <[EMAIL PROTECTED]>
> To: java-user@lucene.apache.org
> Sent: Sunday, November 2, 2008 10:42:52 PM
> Subject: Performance of never optimizing
> 
> Howdy,
> 
> I have a couple of questions regarding some Lucene benchmarking and what the 
> results mean[3]. (Skip to the numbered list at the end if you don't want to 
> read 
> the lengthy exegesis :)
> 
> I'm a developer for JIRA[1]. We are currently trying to get a better 
> understanding of Lucene, and our use of it, to cope with the needs of our 
> larger 
> customers. These "large" indexes are only a couple hundred thousand documents 
> but our problem is compounded by the fact that they have a relatively high 
> rate 
> of modification (=delete+insert of new document) and our users expect these 
> modification to show up in query results pretty much instantly.


This will be a tough call with large indices - there is no real-time search in 
Lucene yet.

> Our current default behaviour is a merge factor of 4. We perform an 
> optimization 
> on the index every 4000 additions. We also perform an optimize at midnight. 
> Our 


I wouldn't optimize every 4000 additions - you are killing IO, rewriting the 
whole index, while trying to provide fast searches, plus you are locking the 
index for other modifications.

> fundamental problem is that these optimizations are locking the index for 
> unacceptably long periods of time, something that we want to resolve for our 
> next major release, hopefully without undermining search performance too 
> badly.


Why are you optimizing?  Trying to make the search faster?  I would try to 
avoid optimizing during high usage periods.

> In the Lucene javadoc there is a comment, and a link to a mailing list 
> discussion[2], that suggests applications such as JIRA should never perform 
> optimize but should instead set their merge factor very low.


Right, you can let Lucene merge segments.

> In an attempt to understand the impact of a) lowering the merge factor from 4 
> to 
> 2 and b) never, ever optimizing on an index (over the course of years and 
> millions of additions/updates) I wanted to try to benchmark Lucene.


One thing that you might not have tried is the constant re-opening of the 
IndexReader, which you'll need to do if you want to see index changes instantly.

> I used the contrib/benchmark framework and wrote a small algorithm that adds 
> documents to an index (using the Reuters doc generator), does a search, does 
> an 
> optimize, then does another search. All the pretty pictures can be seen at:


So you indexed once and then measured search performance?  Or did you measure 
indexing performance?  I can't quite tell from your email.
And in one case you optimized before searching and in the other you did not 
optimize?

>   http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs
> 
> I have several questions, hopefully they aren't overwhelming in their 
> quantity 
> :-/
> 
> 1. Why does the merge factor of 4 appear to be faster than the merge factor 
> of 
> 2?


Faster for indexing or searching?  If indexing, then it's because 4 means fewer 
segment merges than 2.  If searching, then I don't know, unless you had 
indexing and searching happening in parallel, which then means less IO for 4.

Did you index fit in RAM, by the way?

> 2. Why does non-optimized searching appear to be faster than optimized 
> searching 
> once the index hits ~500,000 documents?


Not sure without seeing the index/machine.
It sounds like you were measuring search performance while at the same time 
increasing the index size by incrementally adding more docs?

> 3. There appears to be a fairly sizable performance drop across the board 
> around 
> 450,000 documents. Why is that?

Something to do with Lucene merging index segments around that point?  At this 
point I'm assuming you were measuring search speed while indexing.


> 4. Searching performance appears to decrease towards a fairly pessimistic 20 
> searches per second (for a relatively simple search). Is this really what we 
> should expect long-term from Lucene?


20 reqs/sec sounds very low.  How large is your index, how much RAM, and how 
about heap size?
What were your queries like? random?  from log?

> 5. Does my benchmark even make sense? I am far from an expert on benchmarking 
> so 
> it is possible I'm not measuring what I think I am measuring.


I'm confused by what exactly you did and measured, but it could just be that 
I'm tired.

> Thanks in advance for any insight you can provide. This is an area that we 
> very 
> much want to understand better as Lucene is a key part of JIRA's success,

>
> [1]: http://www.atlassian.com
> [2]: http://www.gossamer-threads.com/lists/lucene/java-dev/47895
> [3]: http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to