[jira] [Commented] (LUCENE-3837) A modest proposal for updateable fields

Robert Muir (Commented) (JIRA) Thu, 01 Mar 2012 10:24:21 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-3837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13220208#comment-13220208
 ]


Robert Muir commented on LUCENE-3837:
-------------------------------------

{quote}
Ad 1. I don't think it's such a big deal, we already return approximate stats 
(too high counts) in presence of deletes. I think we should go all the way, at 
least initially, and ignore stats from an overlay completely, unless the data 
is present only in the overlay - e.g. for terms not present in the main index.
{quote}

I disagree: it may not be a big deal for DefaultSimilarity, but its important 
for other scoring implementations. Initially its extremely important
we get this stuff right before committing anything!

Large problems can result when the statistics are inconsistent with what is 
'discovered' in the docsenum. This is because many scoring models expect
certain relationships to hold true: such as a single doc's tf value won't 
exceed totalTermFreq. We had to do significant work already to ensure
consistency, though in some cases the problems could not totally be solved 
(BasicModelD, BasicModelP, BasicModelBE+NormalizationH3, etc) and we
had to unfortunately resort to only leaving warnings in the javadocs.

I'm fairly certain in all cases we avoid things like NaN or negative scores, 
but when the function 'inverts relevance' is aweful too.

So I think we need a consistent model for stats: thats why I lean towards 
maxDoc(field), which is consistent in every way with how we handle
deletes, and it won't yield any surprises.
                
> A modest proposal for updateable fields
> ---------------------------------------
>
>                 Key: LUCENE-3837
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3837
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: core/index
>    Affects Versions: 4.0
>            Reporter: Andrzej Bialecki 
>
> I'd like to propose a simple design for implementing updateable fields in 
> Lucene. This design has some limitations, so I'm not claiming it will be 
> appropriate for every use case, and it's obvious it has some performance 
> consequences, but at least it's a start...
> This proposal uses a concept of "overlays" or "stacked updates", where the 
> original data is not removed but instead it's overlaid with the new data. I 
> propose to reuse as much of the existing APIs as possible, and represent 
> updates as an IndexReader. Updates to documents in a specific segment would 
> be collected in an "overlay" index specific to that segment, i.e. there would 
> be as many overlay indexes as there are segments in the primary index. 
> A field update would be represented as a new document in the overlay index . 
> The document would consist of just the updated fields, plus a field that 
> records the id in the primary segment of the document affected by the update. 
> These updates would be processed as usual via secondary IndexWriter-s, as 
> many as there are primary segments, so the same analysis chains would be 
> used, the same field types, etc.
> On opening a segment with updates the SegmentReader (see also LUCENE-3836) 
> would check for the presence of the "overlay" index, and if so it would open 
> it first (as an AtomicReader? or it would open individual codec format 
> readers? perhaps it should load the whole thing into memory?), and it would 
> construct an in-memory map between the primary's docId-s and the overlay's 
> docId-s. And finally it would wrap the original format readers with "overlay 
> readers", initialized also with the id map.
> Now, when consumers of the 4D API would ask for specific data, the "overlay 
> readers" would first re-map the primary's docId to the overlay's docId, and 
> check whether overlay data exists for that docId and this type of data (e.g. 
> postings, stored fields, vectors) and return this data instead of the 
> original. Otherwise they would return the original data.
> One obvious performance issue with this appraoch is that the sequential 
> access to primary data would translate into random access to the overlay 
> data. This could be solved by sorting the overlay index so that at least the 
> overlay ids increase monotonically as primary ids do.
> Updates to the primary index would be handled as usual, i.e. segment merges, 
> since the segments with updates would pretend to have no overlays) would just 
> work as usual, only the overlay index would have to be deleted once the 
> primary segment is deleted after merge.
> Updates to the existing documents that already had some fields updated would 
> be again handled as usual, only underneath they would open an IndexWriter on 
> the overlay index for a specific segment.
> That's the broad idea. Feel free to pipe in - I started some coding at the 
> codec level but got stuck using the approach in LUCENE-3836. The approach 
> that uses a modified SegmentReader seems more promising.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-3837) A modest proposal for updateable fields

Reply via email to