Re: Questions about doc store files (.cfx)

Michael Busch Tue, 10 Nov 2009 10:19:10 -0800

On 11/10/09 1:57 AM, Michael McCandless wrote:

I think this is exactly what happens? I wrote a small test program that
creates a situation like mentioned above in the "expungeDelete" scenario. It
ends up with a docstore containing docs from two segments, but after
expungeDeletes only one segment references the docstore. The non-deleted
docs from the other segment end up in a new segment, so they are twice on
disk (once orphaned in the old docstore, once in the new segment).
Is that the desired behavior?

Right this is what happens -- since segment C wasn't merged, it
remains as the only segment still referencing the shared doc stores,
and, yes, this does result in duplicate storage for some docs (until C
is merged away).  IFD keeps track of whether a given set of doc stores
is still referenced.


OK, thanks for clarifying!

I think in practice this should not result in too much duplication.
If C is large, it's likely to have accumulated deletes as well.  If C
is small, it's likely to get merged away in the course of normal
merging.

I agree - it shouldn't happen very often. I was just not sure how thecurrent behavior in this corner case was and wanted to understand it.

But, if we are really concerned with it, we could modify the merge
policy to bias its selection on this ("remove stores that are wasting
too much space") basis.

I'm not too concerned, because I also don't think this should happenvery often.

I think this makes the parallel index job's simpler, right?  Ie, how
the segments are sharing the stores within their own index does not
restrict what merging is done.

Yes exactly. It won't prevent us from keeping the parallel indexesindependent in this regard.

Then the compound (.cfx and .cfs) files are rather orthogonal to this. Italked to Marvin on ApacheCon; in Lucy he wants to have all the compoundfile support in the store package, separately from the indexer. I thinkthat would make sense in Lucene too, there's not really the need to haveit tightly integrated in the IndexWriter and SegmentMerger. We cangeneralize the compound file concept further, so that with parallelindexes the files can be selected in either direction for inclusion in acompound file.

E.g. if we separated the inverted index and store, so that they arelogically two parallel index components, then the .cfx file as it worksnow would contain files from two parallel index components (term vectorsfrom inverted index, stored fields from the store). This is fine if youdon't want to update those components individually and can remain thisway for the default IndexWriter implementation. But if we generalize thecompound concept, then people can alter this behavior to better suittheir update requirements.

I think this would actually be a very clean design (even though it mightsound complicated here).

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Questions about doc store files (.cfx)

Reply via email to