If this happens in a precise zone, how about adding some random jitter to the 
threshold? That tends to get this kind of lock-up unstuck.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Aug 28, 2017, at 12:44 PM, Erick Erickson <erickerick...@gmail.com> wrote:
> 
> And one more thought (not very well thought out).
> 
> A parameter on TMP (or whatever) that did <3> something like:
>> a parameter <autoCompactTime>
>> a parameter <autoCompactPct>
>> On startup TMP takes the current timestamp
> *> Every minute (or whatever) it checks the current timestamp and if
> <autoCompactTime> is in between the last check time and now, do <2>.
>> set the last checked time to the value from * above.
> 
> Taking the current timestamp would keep from kicking of the compaction
> on startup, so we wouldn't need to keep some stateful information
> across restarts and wouldn't go into a compact cycle on startup.
> 
> Erick
> 
> On Sun, Aug 27, 2017 at 11:31 AM, Erick Erickson
> <erickerick...@gmail.com> wrote:
>> I've been thinking about this a little more. Since this is an outlier,
>> I'm loathe to change the core TMP merge selection process. Say the max
>> segment size if 5G. You'd be doing an awful lot of I/O to merge a
>> segment with 4.75G "live" docs with one with 0.25G. Plus that doesn't
>> really allow users who issue the tempting "optimize" command to
>> recover; that one huge segment can hang around for a _very_ long time,
>> accumulating lots of deleted docs. Same with expungeDeletes.
>> 
>> I can think of several approaches:
>> 
>> 1> despite my comment above, a flag that says something like "if a
>> segment has > X% deleted docs, merge it with a smaller segment anyway
>> respecting the max segment size. I know, I know this will affect
>> indexing throughput, do it anyway".
>> 
>> 2> A special op (or perhaps a flag on expungeDeletes) that would
>> behave like <1> but on-demand rather than part of standard merging.
>> 
>> In both of these cases, if a segment had > X% deleted docs but the
>> live doc size for that segment was > the max seg size, rewrite it into
>> a single new segment removing deleted docs.
>> 
>> 3> some way to do the above on a schedule. My notion is something like
>> a maintenance window at 1:00 AM. You'd still have a live collection,
>> but (presumably) a way to purge the day's accumulation of deleted
>> documents during off hours.
>> 
>> 4> ???
>> 
>> I probably like <2> best so far, I don't see this condition in the
>> wild very often it usually occurs during heavy re-indexing operations
>> and often after an optimize or expungeDeletes has happened. <1> could
>> get horribly pathological if the threshold was 1% or something.
>> 
>> WDYT?
>> 
>> 
>> On Wed, Aug 9, 2017 at 2:40 PM, Erick Erickson <erickerick...@gmail.com> 
>> wrote:
>>> Thanks Mike:
>>> 
>>> bq: Or are you saying that each segments 20% of not-deleted docs is
>>> still greater than 1/2 of the max segment size, and so TMP considers
>>> them ineligible?
>>> 
>>> Exactly.
>>> 
>>> Hadn't seen the blog, thanks for that. Added to my list of things to refer 
>>> to.
>>> 
>>> The problem we're seeing is that "in the wild" there are cases where
>>> people can now get satisfactory performance from huge numbers of
>>> documents, as in close to 2B (there was a question on the user's list
>>> about that recently). So allowing up to 60% deleted documents is
>>> dangerous in that situation.
>>> 
>>> And the situation is exacerbated by optimizing (I know, "don't do that").
>>> 
>>> Ah, well, the joys of people using this open source thing and pushing
>>> its limits.
>>> 
>>> Thanks again,
>>> Erick
>>> 
>>> On Tue, Aug 8, 2017 at 3:49 PM, Michael McCandless
>>> <luc...@mikemccandless.com> wrote:
>>>> Hi Erick,
>>>> 
>>>> Some questions/answers below:
>>>> 
>>>> On Sun, Aug 6, 2017 at 8:22 PM, Erick Erickson <erickerick...@gmail.com>
>>>> wrote:
>>>>> 
>>>>> Particularly interested if Mr. McCandless has any opinions here.
>>>>> 
>>>>> I admit it took some work, but I can create an index that never merges
>>>>> and is 80% deleted documents using TieredMergePolicy.
>>>>> 
>>>>> I'm trying to understand how indexes "in the wild" can have > 30%
>>>>> deleted documents. I think the root issue here is that
>>>>> TieredMergePolicy doesn't consider for merging any segments > 50% of
>>>>> maxMergedSegmentMB of non-deleted documents.
>>>>> 
>>>>> Let's say I have segments at the default 5G max. For the sake of
>>>>> argument, it takes exactly 5,000,000 identically-sized documents to
>>>>> fill the segment to exactly 5G.
>>>>> 
>>>>> IIUC, as long as the segment has more than 2,500,000 documents in it
>>>>> it'll never be eligible for merging.
>>>> 
>>>> 
>>>> That's right.
>>>> 
>>>>> 
>>>>> The only way to force deleted
>>>>> docs to be purged is to expungeDeletes or optimize, neither of which
>>>>> is recommended.
>>>> 
>>>> 
>>>> +1
>>>> 
>>>>> The condition I created was highly artificial but illustrative:
>>>>> - I set my max segment size to 20M
>>>>> - Through experimentation I found that each segment would hold roughly
>>>>> 160K synthetic docs.
>>>>> - I set my ramBuffer to 1G.
>>>>> - Then I'd index 500K docs, then delete 400K of them, and commit. This
>>>>> produces a single segment occupying (roughly) 80M of disk space, 15M
>>>>> or so of it "live" documents the rest deleted.
>>>>> - rinse, repeat with a disjoint set of doc IDs.
>>>>> 
>>>>> The number of segments continues to grow forever, each one consisting
>>>>> of 80% deleted documents.
>>>> 
>>>> 
>>>> But wouldn't TMP at some point merge these segments?  Or are you saying 
>>>> that
>>>> each segments 20% of not-deleted docs is still greater than 1/2 of the max
>>>> segment size, and so TMP considers them ineligible?
>>>> 
>>>> This is indeed a rather pathological case, and you're right TMP would never
>>>> merge them (if my logic above is right).  Maybe we could tweak TMP for
>>>> situations like this, though I'm not sure they happen in practice.  
>>>> Normally
>>>> the max segment size is quite a bit larger than the initially flushed
>>>> segment sizes.
>>>> 
>>>>> 
>>>>> This artificial situation just allowed me to see how the segments
>>>>> merged. Without such artificial constraints I suspect the limit for
>>>>> deleted documents would be capped at 50% theoretically and in practice
>>>>> less than that although I have seen 35% or so deleted documents in the
>>>>> wild.
>>>> 
>>>> 
>>>> Yeah I think so too.  I wrote this blog post about deletions:
>>>> https://www.elastic.co/blog/lucenes-handling-of-deleted-documents
>>>> 
>>>> It has a fun chart showing how the %tg deleted docs bounces around.
>>>> 
>>>>> 
>>>>> So at the end of the day I have a couple of questions:
>>>>> 
>>>>> 1> Is my understanding close to correct? This is really the first time
>>>>> I've had to dive into the guts of merging.
>>>> 
>>>> 
>>>> Yes!
>>>> 
>>>>> 
>>>>> 2> Is there a way I've missed to slim down an index other than
>>>>> expungedeletes of optimize/forcemerge?
>>>> 
>>>> 
>>>> No.
>>>> 
>>>>> It seems to me like eventually, with large indexes, every segment that
>>>>> is the max size allowed is going to have to go over 50% deletes before
>>>>> being merged and there will have to be at least two of them. I don't
>>>>> see a clean way to fix this, any algorithm would likely be far too
>>>>> expensive to be part of regular merging. I suppose we could merge
>>>>> segments of different sizes if the combined size was < max segment
>>>>> size. On a quick glance it doesn't seem like the log merge policies
>>>>> address this kind of case either, but haven't dug into them much.
>>>> 
>>>> 
>>>> TMP should be able to merge one max sized segment (that has eek'd just over
>>>> 50% deleted docs) with smaller sized segments.  It would not prefer this
>>>> merge, since merging substantially different segment sizes is poor
>>>> performance vs. merging equally sized segments, but it does have a bias for
>>>> removing deleted docs that would offset that.
>>>> 
>>>>> 
>>>>> Thanks!
>>>> 
>>>> 
>>>> You're welcome!
>>>> 
>>>> Mike McCandless
>>>> 
>>>> http://blog.mikemccandless.com
>>>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 

Reply via email to