[jira] Commented: (LUCENE-2205) Rework of the TermInfosReader class to remove the Terms[], TermInfos[], and the index pointer long[] and create a more memory efficient data structure.

Aaron McCurry (JIRA) Wed, 13 Jan 2010 12:44:23 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799923#action_12799923
 ]


Aaron McCurry commented on LUCENE-2205:
---------------------------------------

Well to be honest, I spent a lot of time making the uni-code UTF-8/16/32 
compare work, and work faster than the default implementation of 
TermInfosReader.  I thought the same thing, but it didn't seem to work faster.  
I think that now that I have a working version as a baseline, I will go back 
and try some different things in the term.text compare.

As far as "Another benefit doing this with flex is you can also change the 
index file format, ie write the vints to disk (so "build" is done at index 
time, not reader startup time), so the init time would be even faster."

I actually have that implemented in our production system to give us an 
"instant on" capability when our huge indexes have to be reloaded.  But I 
thought I would start simple for my contribution.  :)

> Rework of the TermInfosReader class to remove the Terms[], TermInfos[], and 
> the index pointer long[] and create a more memory efficient data structure.
> -------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2205
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2205
>             Project: Lucene - Java
>          Issue Type: Improvement
>         Environment: Java5
>            Reporter: Aaron McCurry
>         Attachments: patch-final.txt, RandomAccessTest.java, rawoutput.txt
>
>
> Basically packing those three arrays into a byte array with an int array as 
> an index offset.  
> The performance benefits are stagering on my test index (of size 6.2 GB, with 
> ~1,000,000 documents and ~175,000,000 terms), the memory needed to load the 
> terminfos into memory were reduced to 17% of there original size.  From 291.5 
> MB to 49.7 MB.  The random access speed has been made better by 1-2%, load 
> time of the segments are ~40% faster as well, and full GC's on my JVM were 
> made 7 times faster.
> I have already performed the work and am offering this code as a patch.  
> Currently all test in the trunk pass with this new code enabled.  I did write 
> a system property switch to allow for the original implementation to be used 
> as well.
> -Dorg.apache.lucene.index.TermInfosReader=default or small
> I have also written a blog about this patch here is the link.
> http://www.nearinfinity.com/blogs/aaron_mccurry/my_first_lucene_patch.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2205) Rework of the TermInfosReader class to remove the Terms[], TermInfos[], and the index pointer long[] and create a more memory efficient data structure.

Reply via email to