[jira] [Commented] (LUCENE-4258) Incremental Field Updates through Stacked Segments

Michael McCandless (JIRA) Fri, 07 Dec 2012 10:31:22 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526609#comment-13526609
 ]


Michael McCandless commented on LUCENE-4258:
--------------------------------------------

{quote}
bq. Why do we have FieldsUpdate.Operation.ADD_DOCUMENT? It seems weird to pass 
that to IW.updateFields? Shouldn't apps just use IW.addDocument?

We have ADD_ and REPLACE_ for FIELDS, and also REPLACE_DOCUMENTS, so having 
ADD_DOCUMENT would allow applications to work only with updateFields. There 
certainly are actions that can be performed in more than one way in this API, 
do you find this too confusing?
{quote}

Well I just generally prefer that there is one [obvious] way to do
something ... it can cause confusion otherwise, ie users will wonder
what's the difference between addDocument and
updateFields(Operation.ADD_DOCUMENT, ...)

{quote}
bq. Why do we need SegmentInfoReader.readFilesList? ...

I considered the alternative you propose of having a segmentInfo for each 
stacked segment, and it seemed too complex to manage than what is done with 
.del files, so I chose the .del files approach. You are right about it's 
privacy, I removed it from SegmentInfoReader and the actual readers have it 
privately.
{quote}

OK.

{quote}
bq. It looks like merge policies don't yet know about / target stacked segments 
...

I was planning to have it in another issue. should I create it already?
{quote}

Another issue is a good idea!  No need to create it yet ... but it
seems like it will be important for real usage.

Do we have any sense of how performance degrades as the stack gets
bigger?  It's more on-the-fly merging at search-time...

I'm worried about that search-time merge cost ... I think it's usually
better to pay a higher indexing cost in exchange for faster search
time, which makes LUCENE-4272 a compelling alternate approach...

{quote}
bq. It seems like we don't invert the document updates until the updates are 
applied? ...

I went for the simple solution trying to introduce as less new concepts as 
possible (and still the patch size is >7000 lines). Your proposal should 
certainly be considered and maybe tested. I need to make sure I do the RAM 
calculations right, the added documents must be reflected in the RAM 
consumption of the deletions queue.
{quote}

OK that makes sense; we should definitely do whatever's
easiest/fastest to get to a dirt path.

We should think through the tradeoffs.  I think it may confuse apps
that the Field is not "consumed" after IW.updateFields returns, but
rather cached and processed later.  This means you cannot reuse
fields, you have to be careful with pre-tokenized fields (can't reuse
the TokenStream), etc.

It also means NRT reopen is unexpectedly costly, because only on flush
will we invert & index the documents, and it's a single-threaded
operation during reopen (vs per-thread if we invert up front).

Still it makes sense to do this for starters ... it's simpler.

{quote}
bq. Why does StoredFieldsReader.visitDocument need a Set for ignored fields?

When fetching stored fields from a segment with replacements, it is possible 
that all contents of a certain field for the base and first n stacked segments 
should be ignored. Therefore, the implementation starts the visiting from the 
most recent updates. If we encounter at some stage a field replacement, that 
field name is added to the Set of ignored fields, and later the content of that 
field in the stacked segments we encounter (which are older updates) is ignored.
{quote}

Ahhh right.

Are stored fields now sparse?  Meaning if I have a segment w/ many
docs, and I update stored fields on one doc, in that tiny stacked
segments will the stored fields files also be tiny?

                
> Incremental Field Updates through Stacked Segments
> --------------------------------------------------
>
>                 Key: LUCENE-4258
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4258
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Sivan Yogev
>         Attachments: IncrementalFieldUpdates.odp, 
> LUCENE-4258-API-changes.patch, LUCENE-4258.r1410593.patch, 
> LUCENE-4258.r1412262.patch, LUCENE-4258.r1416438.patch, 
> LUCENE-4258.r1416617.patch
>
>   Original Estimate: 2,520h
>  Remaining Estimate: 2,520h
>
> Shai and I would like to start working on the proposal to Incremental Field 
> Updates outlined here (http://markmail.org/message/zhrdxxpfk6qvdaex).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4258) Incremental Field Updates through Stacked Segments

Reply via email to