On 11/10/09 1:57 AM, Michael McCandless wrote:
I think this is exactly what happens? I wrote a small test program that
creates a situation like mentioned above in the "expungeDelete" scenario. It
ends up with a docstore containing docs from two segments, but after
expungeDeletes only one segment references the docstore. The non-deleted
docs from the other segment end up in a new segment, so they are twice on
disk (once orphaned in the old docstore, once in the new segment).
Is that the desired behavior?
Right this is what happens -- since segment C wasn't merged, it
remains as the only segment still referencing the shared doc stores,
and, yes, this does result in duplicate storage for some docs (until C
is merged away). IFD keeps track of whether a given set of doc stores
is still referenced.
OK, thanks for clarifying!
I think in practice this should not result in too much duplication.
If C is large, it's likely to have accumulated deletes as well. If C
is small, it's likely to get merged away in the course of normal
merging.
I agree - it shouldn't happen very often. I was just not sure how the
current behavior in this corner case was and wanted to understand it.
But, if we are really concerned with it, we could modify the merge
policy to bias its selection on this ("remove stores that are wasting
too much space") basis.
I'm not too concerned, because I also don't think this should happen
very often.
I think this makes the parallel index job's simpler, right? Ie, how
the segments are sharing the stores within their own index does not
restrict what merging is done.
Yes exactly. It won't prevent us from keeping the parallel indexes
independent in this regard.
Then the compound (.cfx and .cfs) files are rather orthogonal to this. I
talked to Marvin on ApacheCon; in Lucy he wants to have all the compound
file support in the store package, separately from the indexer. I think
that would make sense in Lucene too, there's not really the need to have
it tightly integrated in the IndexWriter and SegmentMerger. We can
generalize the compound file concept further, so that with parallel
indexes the files can be selected in either direction for inclusion in a
compound file.
E.g. if we separated the inverted index and store, so that they are
logically two parallel index components, then the .cfx file as it works
now would contain files from two parallel index components (term vectors
from inverted index, stored fields from the store). This is fine if you
don't want to update those components individually and can remain this
way for the default IndexWriter implementation. But if we generalize the
compound concept, then people can alter this behavior to better suit
their update requirements.
I think this would actually be a very clean design (even though it might
sound complicated here).
Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org