That makes sense.

I guess the alternative would be to occasionally roll the dice and decide to 
merge that kind of segment.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Aug 28, 2017, at 1:28 PM, Erick Erickson <erickerick...@gmail.com> wrote:
> 
> I don't think jitter would help. As long as a segment has > 50% max
> segment size "live" docs, it's forever ineligible for merging (outside
> optimize of expungeDeletes commands). So the "zone" is anything over
> 50%.
> 
> Or I missed your point.
> 
> Erick
> 
> On Mon, Aug 28, 2017 at 12:50 PM, Walter Underwood
> <wun...@wunderwood.org> wrote:
>> If this happens in a precise zone, how about adding some random jitter to
>> the threshold? That tends to get this kind of lock-up unstuck.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>> On Aug 28, 2017, at 12:44 PM, Erick Erickson <erickerick...@gmail.com>
>> wrote:
>> 
>> And one more thought (not very well thought out).
>> 
>> A parameter on TMP (or whatever) that did <3> something like:
>> 
>> a parameter <autoCompactTime>
>> a parameter <autoCompactPct>
>> On startup TMP takes the current timestamp
>> 
>> *> Every minute (or whatever) it checks the current timestamp and if
>> <autoCompactTime> is in between the last check time and now, do <2>.
>> 
>> set the last checked time to the value from * above.
>> 
>> 
>> Taking the current timestamp would keep from kicking of the compaction
>> on startup, so we wouldn't need to keep some stateful information
>> across restarts and wouldn't go into a compact cycle on startup.
>> 
>> Erick
>> 
>> On Sun, Aug 27, 2017 at 11:31 AM, Erick Erickson
>> <erickerick...@gmail.com> wrote:
>> 
>> I've been thinking about this a little more. Since this is an outlier,
>> I'm loathe to change the core TMP merge selection process. Say the max
>> segment size if 5G. You'd be doing an awful lot of I/O to merge a
>> segment with 4.75G "live" docs with one with 0.25G. Plus that doesn't
>> really allow users who issue the tempting "optimize" command to
>> recover; that one huge segment can hang around for a _very_ long time,
>> accumulating lots of deleted docs. Same with expungeDeletes.
>> 
>> I can think of several approaches:
>> 
>> 1> despite my comment above, a flag that says something like "if a
>> segment has > X% deleted docs, merge it with a smaller segment anyway
>> respecting the max segment size. I know, I know this will affect
>> indexing throughput, do it anyway".
>> 
>> 2> A special op (or perhaps a flag on expungeDeletes) that would
>> behave like <1> but on-demand rather than part of standard merging.
>> 
>> In both of these cases, if a segment had > X% deleted docs but the
>> live doc size for that segment was > the max seg size, rewrite it into
>> a single new segment removing deleted docs.
>> 
>> 3> some way to do the above on a schedule. My notion is something like
>> a maintenance window at 1:00 AM. You'd still have a live collection,
>> but (presumably) a way to purge the day's accumulation of deleted
>> documents during off hours.
>> 
>> 4> ???
>> 
>> I probably like <2> best so far, I don't see this condition in the
>> wild very often it usually occurs during heavy re-indexing operations
>> and often after an optimize or expungeDeletes has happened. <1> could
>> get horribly pathological if the threshold was 1% or something.
>> 
>> WDYT?
>> 
>> 
>> On Wed, Aug 9, 2017 at 2:40 PM, Erick Erickson <erickerick...@gmail.com>
>> wrote:
>> 
>> Thanks Mike:
>> 
>> bq: Or are you saying that each segments 20% of not-deleted docs is
>> still greater than 1/2 of the max segment size, and so TMP considers
>> them ineligible?
>> 
>> Exactly.
>> 
>> Hadn't seen the blog, thanks for that. Added to my list of things to refer
>> to.
>> 
>> The problem we're seeing is that "in the wild" there are cases where
>> people can now get satisfactory performance from huge numbers of
>> documents, as in close to 2B (there was a question on the user's list
>> about that recently). So allowing up to 60% deleted documents is
>> dangerous in that situation.
>> 
>> And the situation is exacerbated by optimizing (I know, "don't do that").
>> 
>> Ah, well, the joys of people using this open source thing and pushing
>> its limits.
>> 
>> Thanks again,
>> Erick
>> 
>> On Tue, Aug 8, 2017 at 3:49 PM, Michael McCandless
>> <luc...@mikemccandless.com> wrote:
>> 
>> Hi Erick,
>> 
>> Some questions/answers below:
>> 
>> On Sun, Aug 6, 2017 at 8:22 PM, Erick Erickson <erickerick...@gmail.com>
>> wrote:
>> 
>> 
>> Particularly interested if Mr. McCandless has any opinions here.
>> 
>> I admit it took some work, but I can create an index that never merges
>> and is 80% deleted documents using TieredMergePolicy.
>> 
>> I'm trying to understand how indexes "in the wild" can have > 30%
>> deleted documents. I think the root issue here is that
>> TieredMergePolicy doesn't consider for merging any segments > 50% of
>> maxMergedSegmentMB of non-deleted documents.
>> 
>> Let's say I have segments at the default 5G max. For the sake of
>> argument, it takes exactly 5,000,000 identically-sized documents to
>> fill the segment to exactly 5G.
>> 
>> IIUC, as long as the segment has more than 2,500,000 documents in it
>> it'll never be eligible for merging.
>> 
>> 
>> 
>> That's right.
>> 
>> 
>> The only way to force deleted
>> docs to be purged is to expungeDeletes or optimize, neither of which
>> is recommended.
>> 
>> 
>> 
>> +1
>> 
>> The condition I created was highly artificial but illustrative:
>> - I set my max segment size to 20M
>> - Through experimentation I found that each segment would hold roughly
>> 160K synthetic docs.
>> - I set my ramBuffer to 1G.
>> - Then I'd index 500K docs, then delete 400K of them, and commit. This
>> produces a single segment occupying (roughly) 80M of disk space, 15M
>> or so of it "live" documents the rest deleted.
>> - rinse, repeat with a disjoint set of doc IDs.
>> 
>> The number of segments continues to grow forever, each one consisting
>> of 80% deleted documents.
>> 
>> 
>> 
>> But wouldn't TMP at some point merge these segments?  Or are you saying that
>> each segments 20% of not-deleted docs is still greater than 1/2 of the max
>> segment size, and so TMP considers them ineligible?
>> 
>> This is indeed a rather pathological case, and you're right TMP would never
>> merge them (if my logic above is right).  Maybe we could tweak TMP for
>> situations like this, though I'm not sure they happen in practice.  Normally
>> the max segment size is quite a bit larger than the initially flushed
>> segment sizes.
>> 
>> 
>> This artificial situation just allowed me to see how the segments
>> merged. Without such artificial constraints I suspect the limit for
>> deleted documents would be capped at 50% theoretically and in practice
>> less than that although I have seen 35% or so deleted documents in the
>> wild.
>> 
>> 
>> 
>> Yeah I think so too.  I wrote this blog post about deletions:
>> https://www.elastic.co/blog/lucenes-handling-of-deleted-documents
>> 
>> It has a fun chart showing how the %tg deleted docs bounces around.
>> 
>> 
>> So at the end of the day I have a couple of questions:
>> 
>> 1> Is my understanding close to correct? This is really the first time
>> I've had to dive into the guts of merging.
>> 
>> 
>> 
>> Yes!
>> 
>> 
>> 2> Is there a way I've missed to slim down an index other than
>> expungedeletes of optimize/forcemerge?
>> 
>> 
>> 
>> No.
>> 
>> It seems to me like eventually, with large indexes, every segment that
>> is the max size allowed is going to have to go over 50% deletes before
>> being merged and there will have to be at least two of them. I don't
>> see a clean way to fix this, any algorithm would likely be far too
>> expensive to be part of regular merging. I suppose we could merge
>> segments of different sizes if the combined size was < max segment
>> size. On a quick glance it doesn't seem like the log merge policies
>> address this kind of case either, but haven't dug into them much.
>> 
>> 
>> 
>> TMP should be able to merge one max sized segment (that has eek'd just over
>> 50% deleted docs) with smaller sized segments.  It would not prefer this
>> merge, since merging substantially different segment sizes is poor
>> performance vs. merging equally sized segments, but it does have a bias for
>> removing deleted docs that would offset that.
>> 
>> 
>> Thanks!
>> 
>> 
>> 
>> You're welcome!
>> 
>> Mike McCandless
>> 
>> http://blog.mikemccandless.com
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>> 
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 

Reply via email to