[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845778#action_12845778 ]
Michael McCandless commented on LUCENE-2312: -------------------------------------------- {quote} The prototype I'm experimenting with has a fixed length postings format for the in-memory representation (in TermsHash). Basically every posting has 4 bytes, so I can use int[] arrays (instead of the byte[] pools). The first 3 bytes are used for an absolute docID (not delta-encoded). This limits the max in-memory segment size to 2^24 docs. The 1 remaining byte is used for the position. With a max doc length of 140 characters you can fit every possible position in a byte - what a luxury! If a term occurs multiple times in the same doc, then the TermDocs just skips multiple occurrences with the same docID and increments the freq. Again, the same term doesn't occur often in super short docs. The int[] slices also don't have forward pointers, like in Lucene's TermsHash, but backwards pointers. In real-time search you often want a strongly time-biased ranking. A PostingList object has a pointer that points to the last posting (this statement is not 100% correct for visibility reasons across threads, but we can imagine it this way for now). A TermDocs can now traverse the postinglists in opposite order. Skipping can be done by following pointers to previous slices directly, or by binary search within a slice. {quote} This sounds nice! This would be a custom indexing chain for docs guaranteed not to be over 255 positions in length right? > Search on IndexWriter's RAM Buffer > ---------------------------------- > > Key: LUCENE-2312 > URL: https://issues.apache.org/jira/browse/LUCENE-2312 > Project: Lucene - Java > Issue Type: New Feature > Components: Search > Affects Versions: 3.0.1 > Reporter: Jason Rutherglen > Assignee: Michael Busch > Fix For: 3.1 > > > In order to offer user's near realtime search, without incurring > an indexing performance penalty, we can implement search on > IndexWriter's RAM buffer. This is the buffer that is filled in > RAM as documents are indexed. Currently the RAM buffer is > flushed to the underlying directory (usually disk) before being > made searchable. > Todays Lucene based NRT systems must incur the cost of merging > segments, which can slow indexing. > Michael Busch has good suggestions regarding how to handle deletes using max > doc ids. > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 > The area that isn't fully fleshed out is the terms dictionary, > which needs to be sorted prior to queries executing. Currently > IW implements a specialized hash table. Michael B has a > suggestion here: > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org