liuxiaohui1221 commented on pull request #10861: URL: https://github.com/apache/druid/pull/10861#issuecomment-774811930
> > > > Hi @liuxiaohui1221, thank you for your contribution. I'm wondering how this change can reduce the compaction task failures due to lock contention. Here are what should happen when two or more tasks try to lock overlapped intervals. > > ``` > > * A high priority task is submitted while a low priority task is running. In this case, the high priority task revokes the lock of the low priority task. The low priority task stops with the `FAILED` state. > > > > * A low or equal priority task is submitted while a high or equal priority task is running. In this case, the second task waits (in the `WAITING` state) until the first task releases the lock. > > ``` > > > > > > Compaction tasks can fail in the first case above when there is lock contention. However, the new overlord API (`getNonLockIntervalSnapshots()`) can be useful only for the second case by preventing the coordinator from submitting compaction tasks that can lead to lock contention. The `skipOffsetFromLatest` in the auto compaction config should be enough to avoid both cases unless data can arrive late frequently. Or am I missing something? > > @jihoonson Thank you for your comments, yes, our data arrieve late frequantely, we collected the data, the specific scenario is as follows: of them around `90%` came from the day, but with `10%` distributed in the last 3 months, I need to set `SkipoffsetFromLatest` large enough to avoid frequent Compact task failure. > I want to success as much as possible the compression of a lot of small files, for this reason, I also want to add a different from `NewestSegmentFirstIterator` iteration strategy: `HighScoreSegmentFirstIterator`, it consider interval, segmentsNum, segmentsSize three factors to calculate the score(like at first,use `Min - Max standardization` to convert these columns,and then intervalScore=w1 * intervalEndTime+w2 * segmentsSize+w3 * segmentsNum), high score interval will submit a compaction task priority. In fact, 90% of the real-time data arrives, the vast majority of the data arrives within the last month, very little data arrives within a month or more. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
