[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845778#action_12845778
 ] 

Michael McCandless commented on LUCENE-2312:
--------------------------------------------

{quote}
The prototype I'm experimenting with has a fixed length postings format for the 
in-memory representation (in TermsHash). Basically every posting has 4 bytes, 
so I can use int[] arrays (instead of the byte[] pools). The first 3 bytes are 
used for an absolute docID (not delta-encoded). This limits the max in-memory 
segment size to 2^24 docs. The 1 remaining byte is used for the position. With 
a max doc length of 140 characters you can fit every possible position in a 
byte - what a luxury!  If a term occurs multiple times in the same doc, then 
the TermDocs just skips multiple occurrences with the same docID and increments 
the freq. Again, the same term doesn't occur often in super short docs.

The int[] slices also don't have forward pointers, like in Lucene's TermsHash, 
but backwards pointers. In real-time search you often want a strongly 
time-biased ranking. A PostingList object has a pointer that points to the last 
posting (this statement is not 100% correct for visibility reasons across 
threads, but we can imagine it this way for now). A TermDocs can now traverse 
the postinglists in opposite order. Skipping can be done by following pointers 
to previous slices directly, or by binary search within a slice.
{quote}
This sounds nice!

This would be a custom indexing chain for docs guaranteed not to be over 255 
positions in length right?

> Search on IndexWriter's RAM Buffer
> ----------------------------------
>
>                 Key: LUCENE-2312
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2312
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>    Affects Versions: 3.0.1
>            Reporter: Jason Rutherglen
>            Assignee: Michael Busch
>             Fix For: 3.1
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to