[ 
https://issues.apache.org/jira/browse/LUCENE-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13534859#comment-13534859
 ] 

Michael McCandless commented on LUCENE-4258:
--------------------------------------------

{quote}
bq. Are stored fields now sparse? Meaning if I have a segment w/ many docs, and 
I update stored fields on one doc, in that tiny stacked segments will the 
stored fields files also be tiny?

In such case you will get the equivalent of a segment with multiple docs with 
only one of them containing stored fields. I assume the impls of stored fields 
handle these cases well and you will indeed get tiny stored fields.
{quote}

You're right, this is up to the codec ... hmm but the API isn't sparse (you have
to .addDocument 1M times to "skip over" 1M docs right?), and I'm not sure how 
well our
current default (Lucene41StoredFieldsFormat) handles it.  Have you tested it?

bq. Regarding the API - I made some cleanup, and removed also 
Operation.ADD_DOCUMENT. Now there is only one way to perform each operation, 
and updateFields only allows you to add or replace fields given a term.

OK thanks!

{quote}
bq. This means you cannot reuse fields, you have to be careful with 
pre-tokenized fields (can't reuse the TokenStream), etc.

This is referred in the Javadoc of updateFields, let me know if there's a 
better way to address it.
{quote}

Maybe also state that one cannot reuse Field instances, since the
Field may not be actually "consumed" until some later time (we should
be vague since this really is an implementation detail).

bq. As for the heavier questions. NRT support should be considered separately, 
but the guideline I followed was to keep things as closely as possible to the 
way deletions are handled. Therefore, we need to add to SegmentReader a field 
named liveUpdates - an equivalent to liveDocs. I already put a TODO for this 
(SegmentReader line 131), implementing it won't be simple...

OK ... yeah it's not simple!

bq. The performance tradeoff you are rightfully concerned about should be 
handled through merging. Once you merge an updated segment all updates are 
"cleaned", and the new segment has no performance issues. Apps that perform 
updates should make sure (through MergePolicy) to avoid reader-side updates as 
much as possible.

Merging is very important.  Hmm, are we able to just merge all updates
down to a single update?  Ie, without merging the base segment?  We
can't express that today from MergePolicy right?  In an NRT setting
this seems very important (ie it'd be best bang (= improved search
performance) for the buck (= merge cost)).

I suspect we need to do something with merging before committing
here.

Hmm I see that
StackedTerms.size()/getSumTotalTermFreq()/getSumDocFreq() pulls a
TermsEnum and goes and counts/aggregates all terms ... which in
general is horribly costly?  EG these methods are called per-query to
setup the Sim for scoring ... I think we need another solution here
(not sure what).  Also getDocCount() just returns -1 now ... maybe we
should only allow updates against DOCS_ONLY/omitsNorms fields for now?

Have you done any performance tests on biggish indices?

I think we need a test that indexes a known (randomly generated) set
of documents, randomly sometimes using add and sometimes using
update/replace field, mixing in deletes (just like TestField.addDocuments()),
for the first index, and for the second index only using addDocument
on the "surviving" documents, and then we assertIndexEquals(...) in the
end?  Maybe we can factor out code from TestDuelingCodecs or
TestStressIndexing2.

Where do we account for the RAM used by these buffered updates?  I see
BufferedUpdates.addTerm has some accounting the first time it sees a
given term, but where do we actually add in the RAM used by the
FieldsUpdate itself?

                
> Incremental Field Updates through Stacked Segments
> --------------------------------------------------
>
>                 Key: LUCENE-4258
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4258
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Sivan Yogev
>         Attachments: IncrementalFieldUpdates.odp, 
> LUCENE-4258-API-changes.patch, LUCENE-4258.r1410593.patch, 
> LUCENE-4258.r1412262.patch, LUCENE-4258.r1416438.patch, 
> LUCENE-4258.r1416617.patch, LUCENE-4258.r1422495.patch, 
> LUCENE-4258.r1423010.patch
>
>   Original Estimate: 2,520h
>  Remaining Estimate: 2,520h
>
> Shai and I would like to start working on the proposal to Incremental Field 
> Updates outlined here (http://markmail.org/message/zhrdxxpfk6qvdaex).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to