[ 
https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847024#action_12847024
 ] 

Michael Busch commented on LUCENE-2329:
---------------------------------------

bq. This issue is just about how IndexWriter's RAM buffer stores its terms... 

Actually, when I talked about the TermVectors I meant we should explore to 
store the termIDs on *disk*, rather than the strings.  It would help things 
like similarity search and facet counting.

{quote}
But, note that term vectors today do not store the term char[] again - they 
piggyback on the term char[] already stored for the postings.
{quote}

Yeah I think I'm familiar with that part (secondary entry point in 
TermsHashPerField, hashes based on termStart).  Haven't looked much into how 
the "rest" of the TermVector in-memory data structures are working.  

{quote}
Though, I believe they store "int textStart" (increments by term length per 
unique term), which is less compact than the termID would be (increments +1 per 
unique term)
{quote}

Actually we wouldn't need a second hashtable for the secondary TermsHash 
anymore, right?  It would just have like the primary TermsHash a parallel array 
with the things that the TermVectorsTermsWriter.Postinglist class currently 
contains (freq, lastOffset, lastPosition)?  And the index into that array would 
be the termID of course.

This would be a nice simplification, because no hash collisions, no hash table 
resizing based on load factor, etc. would be necessary for non-primary 
TermsHashes?

bq.  so if eg we someday use packed ints we'd be more RAM efficient by storing 
termIDs...

How does the read performance of packed ints compare to "normal" int[] arrays?  
I think nowadays RAM is less of an issue?  And with a searchable RAM buffer we 
might want to sacrifice a bit more RAM for higher search performance?  Oh man, 
will we need flexible indexing for the in-memory index too? :) 

> Use parallel arrays instead of PostingList objects
> --------------------------------------------------
>
>                 Key: LUCENE-2329
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2329
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>
>
> This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324.
> In order to avoid having very many long-living PostingList objects in 
> TermsHashPerField we want to switch to parallel arrays.  The termsHash will 
> simply be a int[] which maps each term to dense termIDs.
> All data that the PostingList classes currently hold will then we placed in 
> parallel arrays, where the termID is the index into the arrays.  This will 
> avoid the need for object pooling, will remove the overhead of object 
> initialization and garbage collection.  Especially garbage collection should 
> benefit significantly when the JVM runs out of memory, because in such a 
> situation the gc mark times can get very long if there is a big number of 
> long-living objects in memory.
> Another benefit could be to build more efficient TermVectors.  We could avoid 
> the need of having to store the term string per document in the TermVector.  
> Instead we could just store the segment-wide termIDs.  This would reduce the 
> size and also make it easier to implement efficient algorithms that use 
> TermVectors, because no term mapping across documents in a segment would be 
> necessary.  Though this improvement we can make with a separate jira issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to