[ https://issues.apache.org/jira/browse/LUCENE-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13534859#comment-13534859 ]
Michael McCandless commented on LUCENE-4258: -------------------------------------------- {quote} bq. Are stored fields now sparse? Meaning if I have a segment w/ many docs, and I update stored fields on one doc, in that tiny stacked segments will the stored fields files also be tiny? In such case you will get the equivalent of a segment with multiple docs with only one of them containing stored fields. I assume the impls of stored fields handle these cases well and you will indeed get tiny stored fields. {quote} You're right, this is up to the codec ... hmm but the API isn't sparse (you have to .addDocument 1M times to "skip over" 1M docs right?), and I'm not sure how well our current default (Lucene41StoredFieldsFormat) handles it. Have you tested it? bq. Regarding the API - I made some cleanup, and removed also Operation.ADD_DOCUMENT. Now there is only one way to perform each operation, and updateFields only allows you to add or replace fields given a term. OK thanks! {quote} bq. This means you cannot reuse fields, you have to be careful with pre-tokenized fields (can't reuse the TokenStream), etc. This is referred in the Javadoc of updateFields, let me know if there's a better way to address it. {quote} Maybe also state that one cannot reuse Field instances, since the Field may not be actually "consumed" until some later time (we should be vague since this really is an implementation detail). bq. As for the heavier questions. NRT support should be considered separately, but the guideline I followed was to keep things as closely as possible to the way deletions are handled. Therefore, we need to add to SegmentReader a field named liveUpdates - an equivalent to liveDocs. I already put a TODO for this (SegmentReader line 131), implementing it won't be simple... OK ... yeah it's not simple! bq. The performance tradeoff you are rightfully concerned about should be handled through merging. Once you merge an updated segment all updates are "cleaned", and the new segment has no performance issues. Apps that perform updates should make sure (through MergePolicy) to avoid reader-side updates as much as possible. Merging is very important. Hmm, are we able to just merge all updates down to a single update? Ie, without merging the base segment? We can't express that today from MergePolicy right? In an NRT setting this seems very important (ie it'd be best bang (= improved search performance) for the buck (= merge cost)). I suspect we need to do something with merging before committing here. Hmm I see that StackedTerms.size()/getSumTotalTermFreq()/getSumDocFreq() pulls a TermsEnum and goes and counts/aggregates all terms ... which in general is horribly costly? EG these methods are called per-query to setup the Sim for scoring ... I think we need another solution here (not sure what). Also getDocCount() just returns -1 now ... maybe we should only allow updates against DOCS_ONLY/omitsNorms fields for now? Have you done any performance tests on biggish indices? I think we need a test that indexes a known (randomly generated) set of documents, randomly sometimes using add and sometimes using update/replace field, mixing in deletes (just like TestField.addDocuments()), for the first index, and for the second index only using addDocument on the "surviving" documents, and then we assertIndexEquals(...) in the end? Maybe we can factor out code from TestDuelingCodecs or TestStressIndexing2. Where do we account for the RAM used by these buffered updates? I see BufferedUpdates.addTerm has some accounting the first time it sees a given term, but where do we actually add in the RAM used by the FieldsUpdate itself? > Incremental Field Updates through Stacked Segments > -------------------------------------------------- > > Key: LUCENE-4258 > URL: https://issues.apache.org/jira/browse/LUCENE-4258 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index > Reporter: Sivan Yogev > Attachments: IncrementalFieldUpdates.odp, > LUCENE-4258-API-changes.patch, LUCENE-4258.r1410593.patch, > LUCENE-4258.r1412262.patch, LUCENE-4258.r1416438.patch, > LUCENE-4258.r1416617.patch, LUCENE-4258.r1422495.patch, > LUCENE-4258.r1423010.patch > > Original Estimate: 2,520h > Remaining Estimate: 2,520h > > Shai and I would like to start working on the proposal to Incremental Field > Updates outlined here (http://markmail.org/message/zhrdxxpfk6qvdaex). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org