[ https://issues.apache.org/jira/browse/LUCENE-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13537000#comment-13537000 ]
Michael McCandless commented on LUCENE-4258: -------------------------------------------- {quote} After rethinking the point-of-inversion issue, seems like the right time to do it is ASAP - not to hold the added fields and invert them later, but rather invert them immediately and save their inverted version. 3 reasons for that: 1. Take out the constraint I inserted to the API, so update fields can be reused and contain Reader/TokenStrem, 2. NRT support: we cannot search until we invert, and if we invert earlier NRT support will be less complicated, probably some variation on multi-reader to view uncommitted updates, 3. You are correct that we currently do not account for the RAM usage of the FieldsUpdate, since I thought using RAMUsageEstimator will be too costly. It will probably be more efficient to calculate RAM usage of the inverted fields, maybe even during inversion? {quote} +1 I would also add "4. Inversion of updates is single-threaded", ie once we move inversion into .updateFields it will be multi-threaded again. bq. So my question in that regard is how can I invert a document and hold its inverted form to be used by NRT and later inserted into stacked segment? Should I create a temporary Directory and invert into it? Is there another way to do this? I think we should somehow re-use the existing code that inverts (eg FreqProxTermsWriter)? Ie, invert into an in-RAM segment, with "temporary" docIDs, and then when it's time to apply the updates, you need to rewrite the postings to disk with the re-mapped docIDs. I wouldn't do anything special for NRT for starters, meaning, from NRT's standpoint, it opens these stacked segments from disk as it would if a new non-NRT reader was being opened. So I would leave that TODO in SegmentReader as a TODO for now :) Later, we can optimize this and have updates carry in RAM like we do for deletes, but I wouldn't start with that ... {quote} bq. Merging is very important. Hmm, are we able to just merge all updates down to a single update? Ie, without merging the base segment? We can't express that today from MergePolicy right? In an NRT setting this seems very important (ie it'd be best bang (= improved search performance) for the buck (= merge cost)). Shai is helping in creation of a benchmark to test performance in various scenarios. I will start adding updates aspects to the merge policy. I am not sure if merging just updates of a segment is feasible. In what cases would it be better than collapsing all updates into the base segment? {quote} Imagine a huge segment that's accumulating updates ... say it has 20 stacked segments. First off, those stacked segments are each tying up N file descriptors on open, right? (Well, only one if it's CFS). But second off, I would expect search perf with 1 base + 20 stacked is worse than 1 base + 1 stacked? We need to test if that's true ... it's likely that the most perf loss is going from no stacked segments to 1 stacked segment ... and then going from 1 to 20 stacked segments doesn't hurt "that much". We have to test and see. Simply merging that big base segment with its 20 stacked segments is going to be too costly to do very often. {quote} bq. I think we need a test that indexes a known (randomly generated) set of documents, randomly sometimes using add and sometimes using update/replace field, mixing in deletes (just like TestField.addDocuments()), for the first index, and for the second index only using addDocument on the "surviving" documents, and then we assertIndexEquals(...) in the end? Maybe we can factor out code from TestDuelingCodecs or TestStressIndexing2. TestFieldReplacements already had a test which randomly adds documents, replaces documents, adds fields and replaces fields. I refactored it to enable using a seed, and created a "clean" version with only addDocument(...) calls. However, the FieldInfos of the "clean" version do not include things that the "full" version includes because in the full version fields possessing certain field traits where added and then deleted. I will look at the other suggestions. {quote} It should be fine if the FieldInfos don't match? Ie, when comparing the two indices we should not compare field numbers? We should be comparing by only external things like fieldName, which id we had indexed, etc. > Incremental Field Updates through Stacked Segments > -------------------------------------------------- > > Key: LUCENE-4258 > URL: https://issues.apache.org/jira/browse/LUCENE-4258 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index > Reporter: Sivan Yogev > Attachments: IncrementalFieldUpdates.odp, > LUCENE-4258-API-changes.patch, LUCENE-4258.r1410593.patch, > LUCENE-4258.r1412262.patch, LUCENE-4258.r1416438.patch, > LUCENE-4258.r1416617.patch, LUCENE-4258.r1422495.patch, > LUCENE-4258.r1423010.patch > > Original Estimate: 2,520h > Remaining Estimate: 2,520h > > Shai and I would like to start working on the proposal to Incremental Field > Updates outlined here (http://markmail.org/message/zhrdxxpfk6qvdaex). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org