[jira] [Commented] (LUCENE-3837) A modest proposal for updateable fields

Michael McCandless (Commented) (JIRA) Thu, 01 Mar 2012 10:24:25 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-3837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13220211#comment-13220211
 ]


Michael McCandless commented on LUCENE-3837:
--------------------------------------------

I think for scoring the "wrong yet consistent stats" approach is good?  (Just 
like deletes).

So, an update would affect scoring (eg on update the field now has 4 
occurrences of python vs only 1 occurrence before, so now it gets a better 
score), but the scoring will not precisely match the scores I'd get from a full 
re-index instead of an update.
                
> A modest proposal for updateable fields
> ---------------------------------------
>
>                 Key: LUCENE-3837
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3837
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: core/index
>    Affects Versions: 4.0
>            Reporter: Andrzej Bialecki 
>
> I'd like to propose a simple design for implementing updateable fields in 
> Lucene. This design has some limitations, so I'm not claiming it will be 
> appropriate for every use case, and it's obvious it has some performance 
> consequences, but at least it's a start...
> This proposal uses a concept of "overlays" or "stacked updates", where the 
> original data is not removed but instead it's overlaid with the new data. I 
> propose to reuse as much of the existing APIs as possible, and represent 
> updates as an IndexReader. Updates to documents in a specific segment would 
> be collected in an "overlay" index specific to that segment, i.e. there would 
> be as many overlay indexes as there are segments in the primary index. 
> A field update would be represented as a new document in the overlay index . 
> The document would consist of just the updated fields, plus a field that 
> records the id in the primary segment of the document affected by the update. 
> These updates would be processed as usual via secondary IndexWriter-s, as 
> many as there are primary segments, so the same analysis chains would be 
> used, the same field types, etc.
> On opening a segment with updates the SegmentReader (see also LUCENE-3836) 
> would check for the presence of the "overlay" index, and if so it would open 
> it first (as an AtomicReader? or it would open individual codec format 
> readers? perhaps it should load the whole thing into memory?), and it would 
> construct an in-memory map between the primary's docId-s and the overlay's 
> docId-s. And finally it would wrap the original format readers with "overlay 
> readers", initialized also with the id map.
> Now, when consumers of the 4D API would ask for specific data, the "overlay 
> readers" would first re-map the primary's docId to the overlay's docId, and 
> check whether overlay data exists for that docId and this type of data (e.g. 
> postings, stored fields, vectors) and return this data instead of the 
> original. Otherwise they would return the original data.
> One obvious performance issue with this appraoch is that the sequential 
> access to primary data would translate into random access to the overlay 
> data. This could be solved by sorting the overlay index so that at least the 
> overlay ids increase monotonically as primary ids do.
> Updates to the primary index would be handled as usual, i.e. segment merges, 
> since the segments with updates would pretend to have no overlays) would just 
> work as usual, only the overlay index would have to be deleted once the 
> primary segment is deleted after merge.
> Updates to the existing documents that already had some fields updated would 
> be again handled as usual, only underneath they would open an IndexWriter on 
> the overlay index for a specific segment.
> That's the broad idea. Feel free to pipe in - I started some coding at the 
> codec level but got stuck using the approach in LUCENE-3836. The approach 
> that uses a modified SegmentReader seems more promising.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-3837) A modest proposal for updateable fields

Reply via email to