On 11/10/09 1:57 AM, Michael McCandless wrote:

I think this is exactly what happens? I wrote a small test program that
creates a situation like mentioned above in the "expungeDelete" scenario. It
ends up with a docstore containing docs from two segments, but after
expungeDeletes only one segment references the docstore. The non-deleted
docs from the other segment end up in a new segment, so they are twice on
disk (once orphaned in the old docstore, once in the new segment).
Is that the desired behavior?
Right this is what happens -- since segment C wasn't merged, it
remains as the only segment still referencing the shared doc stores,
and, yes, this does result in duplicate storage for some docs (until C
is merged away).  IFD keeps track of whether a given set of doc stores
is still referenced.


OK, thanks for clarifying!

I think in practice this should not result in too much duplication.
If C is large, it's likely to have accumulated deletes as well.  If C
is small, it's likely to get merged away in the course of normal
merging.


I agree - it shouldn't happen very often. I was just not sure how the current behavior in this corner case was and wanted to understand it.

But, if we are really concerned with it, we could modify the merge
policy to bias its selection on this ("remove stores that are wasting
too much space") basis.

I'm not too concerned, because I also don't think this should happen very often.

I think this makes the parallel index job's simpler, right?  Ie, how
the segments are sharing the stores within their own index does not
restrict what merging is done.


Yes exactly. It won't prevent us from keeping the parallel indexes independent in this regard.

Then the compound (.cfx and .cfs) files are rather orthogonal to this. I talked to Marvin on ApacheCon; in Lucy he wants to have all the compound file support in the store package, separately from the indexer. I think that would make sense in Lucene too, there's not really the need to have it tightly integrated in the IndexWriter and SegmentMerger. We can generalize the compound file concept further, so that with parallel indexes the files can be selected in either direction for inclusion in a compound file.

E.g. if we separated the inverted index and store, so that they are logically two parallel index components, then the .cfx file as it works now would contain files from two parallel index components (term vectors from inverted index, stored fields from the store). This is fine if you don't want to update those components individually and can remain this way for the default IndexWriter implementation. But if we generalize the compound concept, then people can alter this behavior to better suit their update requirements.

I think this would actually be a very clean design (even though it might sound complicated here).

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to