A modest proposal for updateable fields
---------------------------------------

                 Key: LUCENE-3837
                 URL: https://issues.apache.org/jira/browse/LUCENE-3837
             Project: Lucene - Java
          Issue Type: New Feature
          Components: core/index
    Affects Versions: 4.0
            Reporter: Andrzej Bialecki 


I'd like to propose a simple design for implementing updateable fields in 
Lucene. This design has some limitations, so I'm not claiming it will be 
appropriate for every use case, and it's obvious it has some performance 
consequences, but at least it's a start...

This proposal uses a concept of "overlays" or "stacked updates", where the 
original data is not removed but instead it's overlaid with the new data. I 
propose to reuse as much of the existing APIs as possible, and represent 
updates as an IndexReader. Updates to documents in a specific segment would be 
collected in an "overlay" index specific to that segment, i.e. there would be 
as many overlay indexes as there are segments in the primary index. 

A field update would be represented as a new document in the overlay index . 
The document would consist of just the updated fields, plus a field that 
records the id in the primary segment of the document affected by the update. 
These updates would be processed as usual via secondary IndexWriter-s, as many 
as there are primary segments, so the same analysis chains would be used, the 
same field types, etc.

On opening a segment with updates the SegmentReader (see also LUCENE-3836) 
would check for the presence of the "overlay" index, and if so it would open it 
first (as an AtomicReader? or it would open individual codec format readers? 
perhaps it should load the whole thing into memory?), and it would construct an 
in-memory map between the primary's docId-s and the overlay's docId-s. And 
finally it would wrap the original format readers with "overlay readers", 
initialized also with the id map.

Now, when consumers of the 4D API would ask for specific data, the "overlay 
readers" would first re-map the primary's docId to the overlay's docId, and 
check whether overlay data exists for that docId and this type of data (e.g. 
postings, stored fields, vectors) and return this data instead of the original. 
Otherwise they would return the original data.

One obvious performance issue with this appraoch is that the sequential access 
to primary data would translate into random access to the overlay data. This 
could be solved by sorting the overlay index so that at least the overlay ids 
increase monotonically as primary ids do.

Updates to the primary index would be handled as usual, i.e. segment merges, 
since the segments with updates would pretend to have no overlays) would just 
work as usual, only the overlay index would have to be deleted once the primary 
segment is deleted after merge.

Updates to the existing documents that already had some fields updated would be 
again handled as usual, only underneath they would open an IndexWriter on the 
overlay index for a specific segment.

That's the broad idea. Feel free to pipe in - I started some coding at the 
codec level but got stuck using the approach in LUCENE-3836. The approach that 
uses a modified SegmentReader seems more promising.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to