If this happens in a precise zone, how about adding some random jitter to the threshold? That tends to get this kind of lock-up unstuck.
wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Aug 28, 2017, at 12:44 PM, Erick Erickson <erickerick...@gmail.com> wrote: > > And one more thought (not very well thought out). > > A parameter on TMP (or whatever) that did <3> something like: >> a parameter <autoCompactTime> >> a parameter <autoCompactPct> >> On startup TMP takes the current timestamp > *> Every minute (or whatever) it checks the current timestamp and if > <autoCompactTime> is in between the last check time and now, do <2>. >> set the last checked time to the value from * above. > > Taking the current timestamp would keep from kicking of the compaction > on startup, so we wouldn't need to keep some stateful information > across restarts and wouldn't go into a compact cycle on startup. > > Erick > > On Sun, Aug 27, 2017 at 11:31 AM, Erick Erickson > <erickerick...@gmail.com> wrote: >> I've been thinking about this a little more. Since this is an outlier, >> I'm loathe to change the core TMP merge selection process. Say the max >> segment size if 5G. You'd be doing an awful lot of I/O to merge a >> segment with 4.75G "live" docs with one with 0.25G. Plus that doesn't >> really allow users who issue the tempting "optimize" command to >> recover; that one huge segment can hang around for a _very_ long time, >> accumulating lots of deleted docs. Same with expungeDeletes. >> >> I can think of several approaches: >> >> 1> despite my comment above, a flag that says something like "if a >> segment has > X% deleted docs, merge it with a smaller segment anyway >> respecting the max segment size. I know, I know this will affect >> indexing throughput, do it anyway". >> >> 2> A special op (or perhaps a flag on expungeDeletes) that would >> behave like <1> but on-demand rather than part of standard merging. >> >> In both of these cases, if a segment had > X% deleted docs but the >> live doc size for that segment was > the max seg size, rewrite it into >> a single new segment removing deleted docs. >> >> 3> some way to do the above on a schedule. My notion is something like >> a maintenance window at 1:00 AM. You'd still have a live collection, >> but (presumably) a way to purge the day's accumulation of deleted >> documents during off hours. >> >> 4> ??? >> >> I probably like <2> best so far, I don't see this condition in the >> wild very often it usually occurs during heavy re-indexing operations >> and often after an optimize or expungeDeletes has happened. <1> could >> get horribly pathological if the threshold was 1% or something. >> >> WDYT? >> >> >> On Wed, Aug 9, 2017 at 2:40 PM, Erick Erickson <erickerick...@gmail.com> >> wrote: >>> Thanks Mike: >>> >>> bq: Or are you saying that each segments 20% of not-deleted docs is >>> still greater than 1/2 of the max segment size, and so TMP considers >>> them ineligible? >>> >>> Exactly. >>> >>> Hadn't seen the blog, thanks for that. Added to my list of things to refer >>> to. >>> >>> The problem we're seeing is that "in the wild" there are cases where >>> people can now get satisfactory performance from huge numbers of >>> documents, as in close to 2B (there was a question on the user's list >>> about that recently). So allowing up to 60% deleted documents is >>> dangerous in that situation. >>> >>> And the situation is exacerbated by optimizing (I know, "don't do that"). >>> >>> Ah, well, the joys of people using this open source thing and pushing >>> its limits. >>> >>> Thanks again, >>> Erick >>> >>> On Tue, Aug 8, 2017 at 3:49 PM, Michael McCandless >>> <luc...@mikemccandless.com> wrote: >>>> Hi Erick, >>>> >>>> Some questions/answers below: >>>> >>>> On Sun, Aug 6, 2017 at 8:22 PM, Erick Erickson <erickerick...@gmail.com> >>>> wrote: >>>>> >>>>> Particularly interested if Mr. McCandless has any opinions here. >>>>> >>>>> I admit it took some work, but I can create an index that never merges >>>>> and is 80% deleted documents using TieredMergePolicy. >>>>> >>>>> I'm trying to understand how indexes "in the wild" can have > 30% >>>>> deleted documents. I think the root issue here is that >>>>> TieredMergePolicy doesn't consider for merging any segments > 50% of >>>>> maxMergedSegmentMB of non-deleted documents. >>>>> >>>>> Let's say I have segments at the default 5G max. For the sake of >>>>> argument, it takes exactly 5,000,000 identically-sized documents to >>>>> fill the segment to exactly 5G. >>>>> >>>>> IIUC, as long as the segment has more than 2,500,000 documents in it >>>>> it'll never be eligible for merging. >>>> >>>> >>>> That's right. >>>> >>>>> >>>>> The only way to force deleted >>>>> docs to be purged is to expungeDeletes or optimize, neither of which >>>>> is recommended. >>>> >>>> >>>> +1 >>>> >>>>> The condition I created was highly artificial but illustrative: >>>>> - I set my max segment size to 20M >>>>> - Through experimentation I found that each segment would hold roughly >>>>> 160K synthetic docs. >>>>> - I set my ramBuffer to 1G. >>>>> - Then I'd index 500K docs, then delete 400K of them, and commit. This >>>>> produces a single segment occupying (roughly) 80M of disk space, 15M >>>>> or so of it "live" documents the rest deleted. >>>>> - rinse, repeat with a disjoint set of doc IDs. >>>>> >>>>> The number of segments continues to grow forever, each one consisting >>>>> of 80% deleted documents. >>>> >>>> >>>> But wouldn't TMP at some point merge these segments? Or are you saying >>>> that >>>> each segments 20% of not-deleted docs is still greater than 1/2 of the max >>>> segment size, and so TMP considers them ineligible? >>>> >>>> This is indeed a rather pathological case, and you're right TMP would never >>>> merge them (if my logic above is right). Maybe we could tweak TMP for >>>> situations like this, though I'm not sure they happen in practice. >>>> Normally >>>> the max segment size is quite a bit larger than the initially flushed >>>> segment sizes. >>>> >>>>> >>>>> This artificial situation just allowed me to see how the segments >>>>> merged. Without such artificial constraints I suspect the limit for >>>>> deleted documents would be capped at 50% theoretically and in practice >>>>> less than that although I have seen 35% or so deleted documents in the >>>>> wild. >>>> >>>> >>>> Yeah I think so too. I wrote this blog post about deletions: >>>> https://www.elastic.co/blog/lucenes-handling-of-deleted-documents >>>> >>>> It has a fun chart showing how the %tg deleted docs bounces around. >>>> >>>>> >>>>> So at the end of the day I have a couple of questions: >>>>> >>>>> 1> Is my understanding close to correct? This is really the first time >>>>> I've had to dive into the guts of merging. >>>> >>>> >>>> Yes! >>>> >>>>> >>>>> 2> Is there a way I've missed to slim down an index other than >>>>> expungedeletes of optimize/forcemerge? >>>> >>>> >>>> No. >>>> >>>>> It seems to me like eventually, with large indexes, every segment that >>>>> is the max size allowed is going to have to go over 50% deletes before >>>>> being merged and there will have to be at least two of them. I don't >>>>> see a clean way to fix this, any algorithm would likely be far too >>>>> expensive to be part of regular merging. I suppose we could merge >>>>> segments of different sizes if the combined size was < max segment >>>>> size. On a quick glance it doesn't seem like the log merge policies >>>>> address this kind of case either, but haven't dug into them much. >>>> >>>> >>>> TMP should be able to merge one max sized segment (that has eek'd just over >>>> 50% deleted docs) with smaller sized segments. It would not prefer this >>>> merge, since merging substantially different segment sizes is poor >>>> performance vs. merging equally sized segments, but it does have a bias for >>>> removing deleted docs that would offset that. >>>> >>>>> >>>>> Thanks! >>>> >>>> >>>> You're welcome! >>>> >>>> Mike McCandless >>>> >>>> http://blog.mikemccandless.com >>>> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org >