[ 
https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848475#action_12848475
 ] 

Michael Busch commented on LUCENE-2329:
---------------------------------------

I did some performance experiments:

I indexed 1M wikipedia documents using the cheap WhiteSpaceAnalyzer, no cfs 
files, disabled any merging,  RAM buffer size = 200MB, single writer thread, 
TermVectors enabled.

h4. Results with -Xmx2000m:

|| || Write performance || Gain || Number of segments ||
| trunk | 833 docs/sec |  |  41 |
| trunk + parallel arrays | 869 docs/sec | {color:green} + 4.3% {color} | 32|


h4. Results with -Xmx256m:

|| || Write performance || Gain || Number of segments ||
| trunk | 467 docs/sec |  | 41 |  
| trunk + parallel arrays | 871 docs/sec | {color:green} +86.5% {color} | 32|

So I think these results are interesting and roughly as expected.  4.3% is a 
nice small performance gain.
But running the tests with a low heap shows how much cheaper the garbage 
collection becomes.  Setting IW's RAM buffer to 200MB and the overall heap to 
256MB forces the gc to run frequently.  The mark times are much more costly if 
we have all long-living PostingList objects in memory compared to parallel 
arrays.

So this is probably not a huge deal for "normal" indexing.  But once we can 
search on the RAM buffer it becomes much more attractive to fill up the RAM as 
much as you can.  And exactly in that case we safe a lot with this improvement. 

Also note that the number of segments decreased by 22% (from 41 to 32).  This 
shows that the parallel-array approach needs less RAM, thus flushes less often 
and will cause less segment merges in the long run.  So a longer test with 
actual segment merges would show even bigger gains (with both big and small 
heaps).

So overall, I'm very happy with these results!


> Use parallel arrays instead of PostingList objects
> --------------------------------------------------
>
>                 Key: LUCENE-2329
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2329
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: lucene-2329.patch, lucene-2329.patch, lucene-2329.patch
>
>
> This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324.
> In order to avoid having very many long-living PostingList objects in 
> TermsHashPerField we want to switch to parallel arrays.  The termsHash will 
> simply be a int[] which maps each term to dense termIDs.
> All data that the PostingList classes currently hold will then we placed in 
> parallel arrays, where the termID is the index into the arrays.  This will 
> avoid the need for object pooling, will remove the overhead of object 
> initialization and garbage collection.  Especially garbage collection should 
> benefit significantly when the JVM runs out of memory, because in such a 
> situation the gc mark times can get very long if there is a big number of 
> long-living objects in memory.
> Another benefit could be to build more efficient TermVectors.  We could avoid 
> the need of having to store the term string per document in the TermVector.  
> Instead we could just store the segment-wide termIDs.  This would reduce the 
> size and also make it easier to implement efficient algorithms that use 
> TermVectors, because no term mapping across documents in a segment would be 
> necessary.  Though this improvement we can make with a separate jira issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to