Particularly interested if Mr. McCandless has any opinions here.

I admit it took some work, but I can create an index that never merges
and is 80% deleted documents using TieredMergePolicy.

I'm trying to understand how indexes "in the wild" can have > 30%
deleted documents. I think the root issue here is that
TieredMergePolicy doesn't consider for merging any segments > 50% of
maxMergedSegmentMB of non-deleted documents.

Let's say I have segments at the default 5G max. For the sake of
argument, it takes exactly 5,000,000 identically-sized documents to
fill the segment to exactly 5G.

IIUC, as long as the segment has more than 2,500,000 documents in it
it'll never be eligible for merging. The only way to force deleted
docs to be purged is to expungeDeletes or optimize, neither of which
is recommended.

The condition I created was highly artificial but illustrative:
- I set my max segment size to 20M
- Through experimentation I found that each segment would hold roughly
160K synthetic docs.
- I set my ramBuffer to 1G.
- Then I'd index 500K docs, then delete 400K of them, and commit. This
produces a single segment occupying (roughly) 80M of disk space, 15M
or so of it "live" documents the rest deleted.
- rinse, repeat with a disjoint set of doc IDs.

The number of segments continues to grow forever, each one consisting
of 80% deleted documents.

This artificial situation just allowed me to see how the segments
merged. Without such artificial constraints I suspect the limit for
deleted documents would be capped at 50% theoretically and in practice
less than that although I have seen 35% or so deleted documents in the
wild.

So at the end of the day I have a couple of questions:

1> Is my understanding close to correct? This is really the first time
I've had to dive into the guts of merging.

2> Is there a way I've missed to slim down an index other than
expungedeletes of optimize/forcemerge?

It seems to me like eventually, with large indexes, every segment that
is the max size allowed is going to have to go over 50% deletes before
being merged and there will have to be at least two of them. I don't
see a clean way to fix this, any algorithm would likely be far too
expensive to be part of regular merging. I suppose we could merge
segments of different sizes if the combined size was < max segment
size. On a quick glance it doesn't seem like the log merge policies
address this kind of case either, but haven't dug into them much.

Thanks!
Erick

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to