And one more thought (not very well thought out). A parameter on TMP (or whatever) that did <3> something like: > a parameter <autoCompactTime> > a parameter <autoCompactPct> > On startup TMP takes the current timestamp *> Every minute (or whatever) it checks the current timestamp and if <autoCompactTime> is in between the last check time and now, do <2>. > set the last checked time to the value from * above.
Taking the current timestamp would keep from kicking of the compaction on startup, so we wouldn't need to keep some stateful information across restarts and wouldn't go into a compact cycle on startup. Erick On Sun, Aug 27, 2017 at 11:31 AM, Erick Erickson <erickerick...@gmail.com> wrote: > I've been thinking about this a little more. Since this is an outlier, > I'm loathe to change the core TMP merge selection process. Say the max > segment size if 5G. You'd be doing an awful lot of I/O to merge a > segment with 4.75G "live" docs with one with 0.25G. Plus that doesn't > really allow users who issue the tempting "optimize" command to > recover; that one huge segment can hang around for a _very_ long time, > accumulating lots of deleted docs. Same with expungeDeletes. > > I can think of several approaches: > > 1> despite my comment above, a flag that says something like "if a > segment has > X% deleted docs, merge it with a smaller segment anyway > respecting the max segment size. I know, I know this will affect > indexing throughput, do it anyway". > > 2> A special op (or perhaps a flag on expungeDeletes) that would > behave like <1> but on-demand rather than part of standard merging. > > In both of these cases, if a segment had > X% deleted docs but the > live doc size for that segment was > the max seg size, rewrite it into > a single new segment removing deleted docs. > > 3> some way to do the above on a schedule. My notion is something like > a maintenance window at 1:00 AM. You'd still have a live collection, > but (presumably) a way to purge the day's accumulation of deleted > documents during off hours. > > 4> ??? > > I probably like <2> best so far, I don't see this condition in the > wild very often it usually occurs during heavy re-indexing operations > and often after an optimize or expungeDeletes has happened. <1> could > get horribly pathological if the threshold was 1% or something. > > WDYT? > > > On Wed, Aug 9, 2017 at 2:40 PM, Erick Erickson <erickerick...@gmail.com> > wrote: >> Thanks Mike: >> >> bq: Or are you saying that each segments 20% of not-deleted docs is >> still greater than 1/2 of the max segment size, and so TMP considers >> them ineligible? >> >> Exactly. >> >> Hadn't seen the blog, thanks for that. Added to my list of things to refer >> to. >> >> The problem we're seeing is that "in the wild" there are cases where >> people can now get satisfactory performance from huge numbers of >> documents, as in close to 2B (there was a question on the user's list >> about that recently). So allowing up to 60% deleted documents is >> dangerous in that situation. >> >> And the situation is exacerbated by optimizing (I know, "don't do that"). >> >> Ah, well, the joys of people using this open source thing and pushing >> its limits. >> >> Thanks again, >> Erick >> >> On Tue, Aug 8, 2017 at 3:49 PM, Michael McCandless >> <luc...@mikemccandless.com> wrote: >>> Hi Erick, >>> >>> Some questions/answers below: >>> >>> On Sun, Aug 6, 2017 at 8:22 PM, Erick Erickson <erickerick...@gmail.com> >>> wrote: >>>> >>>> Particularly interested if Mr. McCandless has any opinions here. >>>> >>>> I admit it took some work, but I can create an index that never merges >>>> and is 80% deleted documents using TieredMergePolicy. >>>> >>>> I'm trying to understand how indexes "in the wild" can have > 30% >>>> deleted documents. I think the root issue here is that >>>> TieredMergePolicy doesn't consider for merging any segments > 50% of >>>> maxMergedSegmentMB of non-deleted documents. >>>> >>>> Let's say I have segments at the default 5G max. For the sake of >>>> argument, it takes exactly 5,000,000 identically-sized documents to >>>> fill the segment to exactly 5G. >>>> >>>> IIUC, as long as the segment has more than 2,500,000 documents in it >>>> it'll never be eligible for merging. >>> >>> >>> That's right. >>> >>>> >>>> The only way to force deleted >>>> docs to be purged is to expungeDeletes or optimize, neither of which >>>> is recommended. >>> >>> >>> +1 >>> >>>> The condition I created was highly artificial but illustrative: >>>> - I set my max segment size to 20M >>>> - Through experimentation I found that each segment would hold roughly >>>> 160K synthetic docs. >>>> - I set my ramBuffer to 1G. >>>> - Then I'd index 500K docs, then delete 400K of them, and commit. This >>>> produces a single segment occupying (roughly) 80M of disk space, 15M >>>> or so of it "live" documents the rest deleted. >>>> - rinse, repeat with a disjoint set of doc IDs. >>>> >>>> The number of segments continues to grow forever, each one consisting >>>> of 80% deleted documents. >>> >>> >>> But wouldn't TMP at some point merge these segments? Or are you saying that >>> each segments 20% of not-deleted docs is still greater than 1/2 of the max >>> segment size, and so TMP considers them ineligible? >>> >>> This is indeed a rather pathological case, and you're right TMP would never >>> merge them (if my logic above is right). Maybe we could tweak TMP for >>> situations like this, though I'm not sure they happen in practice. Normally >>> the max segment size is quite a bit larger than the initially flushed >>> segment sizes. >>> >>>> >>>> This artificial situation just allowed me to see how the segments >>>> merged. Without such artificial constraints I suspect the limit for >>>> deleted documents would be capped at 50% theoretically and in practice >>>> less than that although I have seen 35% or so deleted documents in the >>>> wild. >>> >>> >>> Yeah I think so too. I wrote this blog post about deletions: >>> https://www.elastic.co/blog/lucenes-handling-of-deleted-documents >>> >>> It has a fun chart showing how the %tg deleted docs bounces around. >>> >>>> >>>> So at the end of the day I have a couple of questions: >>>> >>>> 1> Is my understanding close to correct? This is really the first time >>>> I've had to dive into the guts of merging. >>> >>> >>> Yes! >>> >>>> >>>> 2> Is there a way I've missed to slim down an index other than >>>> expungedeletes of optimize/forcemerge? >>> >>> >>> No. >>> >>>> It seems to me like eventually, with large indexes, every segment that >>>> is the max size allowed is going to have to go over 50% deletes before >>>> being merged and there will have to be at least two of them. I don't >>>> see a clean way to fix this, any algorithm would likely be far too >>>> expensive to be part of regular merging. I suppose we could merge >>>> segments of different sizes if the combined size was < max segment >>>> size. On a quick glance it doesn't seem like the log merge policies >>>> address this kind of case either, but haven't dug into them much. >>> >>> >>> TMP should be able to merge one max sized segment (that has eek'd just over >>> 50% deleted docs) with smaller sized segments. It would not prefer this >>> merge, since merging substantially different segment sizes is poor >>> performance vs. merging equally sized segments, but it does have a bias for >>> removing deleted docs that would offset that. >>> >>>> >>>> Thanks! >>> >>> >>> You're welcome! >>> >>> Mike McCandless >>> >>> http://blog.mikemccandless.com >>> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org