[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845059#action_12845059 ]
Michael McCandless commented on LUCENE-2312: -------------------------------------------- bq. IW flush could become thread dependent Right, we want this -- different RAM segments should be flushed at different times. This gives us better concurrency since IO/CPU resource consumption will now be more interleaved. While one RAM segment is flushing, the others are still indexing. {quote} A new term will first check the hash table for existence (as currently), if it's not in the term hash table only then will it be added to the btree (btw, binary search is O(log N) on average?) This way we're avoiding the somewhat costlier btree existence check per token. {quote} Yes, we could have btree on-the-side but still use hash for mapping (vs using btree alone). Hash will be faster lookups... btree could be created/updated on demand first time something needs to .next() through the TermsEnum. {quote The algorithm for flushing doc writers based on RAM consumption can simply be, on exceed, flush the doc writer consuming the most RAM {quote} Sounds good :) The challenge will be balancing things... eg if during the time 1 RAM segment is flushed, the others are able to consume more RAM that was freed up by flushing this one RAM segment, you've got a problem... or maybe at that point you go and flush the next one now using the most RAM, so it'd self balance with time. This will mean the RAM usage is able to flare up above the high water mark... {quote} I gutted the PerThread classes, then realized, it's all too intertwined. I'd rather get something working, than spend an excessive amount of time rearranging code that already works. {quote} For starters I would keep the *PerThread, but create multiple DWs? Ie, removing the PerThread layer doesn't have to happen at first. Or we could do the nuclear option -- make a new indexing chain. > Search on IndexWriter's RAM Buffer > ---------------------------------- > > Key: LUCENE-2312 > URL: https://issues.apache.org/jira/browse/LUCENE-2312 > Project: Lucene - Java > Issue Type: New Feature > Components: Search > Affects Versions: 3.0.1 > Reporter: Jason Rutherglen > Assignee: Michael Busch > Fix For: 3.1 > > > In order to offer user's near realtime search, without incurring > an indexing performance penalty, we can implement search on > IndexWriter's RAM buffer. This is the buffer that is filled in > RAM as documents are indexed. Currently the RAM buffer is > flushed to the underlying directory (usually disk) before being > made searchable. > Todays Lucene based NRT systems must incur the cost of merging > segments, which can slow indexing. > Michael Busch has good suggestions regarding how to handle deletes using max > doc ids. > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 > The area that isn't fully fleshed out is the terms dictionary, > which needs to be sorted prior to queries executing. Currently > IW implements a specialized hash table. Michael B has a > suggestion here: > https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org