[jira] [Commented] (LUCENE-4258) Incremental Field Updates through Stacked Segments

Michael McCandless (JIRA) Thu, 20 Dec 2012 04:47:15 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13537000#comment-13537000
 ]


Michael McCandless commented on LUCENE-4258:
--------------------------------------------


{quote}
After rethinking the point-of-inversion issue, seems like the right time to do 
it is ASAP - not to hold the added fields and invert them later, but rather 
invert them immediately and save their inverted version. 3 reasons for that:
1. Take out the constraint I inserted to the API, so update fields can be 
reused and contain Reader/TokenStrem,
2. NRT support: we cannot search until we invert, and if we invert earlier NRT 
support will be less complicated, probably some variation on multi-reader to 
view uncommitted updates,
3. You are correct that we currently do not account for the RAM usage of the 
FieldsUpdate, since I thought using RAMUsageEstimator will be too costly. It 
will probably be more efficient to calculate RAM usage of the inverted fields, 
maybe even during inversion?
{quote}

+1

I would also add "4. Inversion of updates is single-threaded", ie once
we move inversion into .updateFields it will be multi-threaded again.

bq. So my question in that regard is how can I invert a document and hold its 
inverted form to be used by NRT and later inserted into stacked segment? Should 
I create a temporary Directory and invert into it? Is there another way to do 
this?

I think we should somehow re-use the existing code that inverts (eg
FreqProxTermsWriter)?  Ie, invert into an in-RAM segment, with
"temporary" docIDs, and then when it's time to apply the updates, you
need to rewrite the postings to disk with the re-mapped docIDs.

I wouldn't do anything special for NRT for starters, meaning, from
NRT's standpoint, it opens these stacked segments from disk as it
would if a new non-NRT reader was being opened.  So I would leave that
TODO in SegmentReader as a TODO for now :)  Later, we can optimize
this and have updates carry in RAM like we do for deletes, but I
wouldn't start with that ...

{quote}
bq. Merging is very important. Hmm, are we able to just merge all updates down 
to a single update? Ie, without merging the base segment? We can't express that 
today from MergePolicy right? In an NRT setting this seems very important (ie 
it'd be best bang (= improved search performance) for the buck (= merge cost)).

Shai is helping in creation of a benchmark to test performance in various 
scenarios. I will start adding updates aspects to the merge policy. I am not 
sure if merging just updates of a segment is feasible. In what cases would it 
be better than collapsing all updates into the base segment?
{quote}

Imagine a huge segment that's accumulating updates ... say it has 20
stacked segments.  First off, those stacked segments are each tying up
N file descriptors on open, right?  (Well, only one if it's CFS).  But
second off, I would expect search perf with 1 base + 20 stacked is
worse than 1 base + 1 stacked?  We need to test if that's true
... it's likely that the most perf loss is going from no stacked
segments to 1 stacked segment ... and then going from 1 to 20 stacked
segments doesn't hurt "that much".  We have to test and see.

Simply merging that big base segment with its 20 stacked segments is
going to be too costly to do very often.

{quote}
bq. I think we need a test that indexes a known (randomly generated) set of 
documents, randomly sometimes using add and sometimes using update/replace 
field, mixing in deletes (just like TestField.addDocuments()), for the first 
index, and for the second index only using addDocument on the "surviving" 
documents, and then we assertIndexEquals(...) in the end? Maybe we can factor 
out code from TestDuelingCodecs or TestStressIndexing2.

TestFieldReplacements already had a test which randomly adds documents, 
replaces documents, adds fields and replaces fields. I refactored it to enable 
using a seed, and created a "clean" version with only addDocument(...) calls. 
However, the FieldInfos of the "clean" version do not include things that the 
"full" version includes because in the full version fields possessing certain 
field traits where added and then deleted. I will look at the other suggestions.
{quote}

It should be fine if the FieldInfos don't match?  Ie, when comparing
the two indices we should not compare field numbers?  We should be
comparing by only external things like fieldName, which id we had
indexed, etc.

                
> Incremental Field Updates through Stacked Segments
> --------------------------------------------------
>
>                 Key: LUCENE-4258
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4258
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Sivan Yogev
>         Attachments: IncrementalFieldUpdates.odp, 
> LUCENE-4258-API-changes.patch, LUCENE-4258.r1410593.patch, 
> LUCENE-4258.r1412262.patch, LUCENE-4258.r1416438.patch, 
> LUCENE-4258.r1416617.patch, LUCENE-4258.r1422495.patch, 
> LUCENE-4258.r1423010.patch
>
>   Original Estimate: 2,520h
>  Remaining Estimate: 2,520h
>
> Shai and I would like to start working on the proposal to Incremental Field 
> Updates outlined here (http://markmail.org/message/zhrdxxpfk6qvdaex).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4258) Incremental Field Updates through Stacked Segments

Reply via email to