paul-rogers commented on PR #13852: URL: https://github.com/apache/druid/pull/13852#issuecomment-1455001687
I’m a newbie to this code so I took a deep dive to understand how it works. The compaction scheduling algorithm is rather crude: it isn’t well designed for a large, busy, distributed system. @suneet-s ’s fix is a worth-while short term fix, but this code really needs a redesign along the lines outlined by @imply-cheddar and @kfaraz. For anyone else new to this area, here’s an overview of the logic. Please forgive any newbie misunderstandings. At the highest level: * The coordinator provides a set of duties which are called one after another on a schedule. * Duties have Guice-injected global state, but no local state across calls. * Each duty is called from the main duty loop and thus must complete quickly. * The compaction scheduler is a duty. As such, each call is independent from any previous call, and the scheduler must make its decisions quickly. The challenge, then, is to design the compaction scheduler to be stateless and fast. On each call it: * Determines the number of available slots, accounting for running tasks. * Creates a big iterator over all datasources with compaction specs and their segments sorted by time chunk, minus any segments locked by existing compaction or ingestion tasks. * Launches a set of compaction tasks from the head of the iterator until slots are full. In short, each call to the scheduler duty builds a large list of candidate actions: “an iterator based on the latest segment metadata available”. The scheduler then starts working down that list from the head, stopping when the worker slots are filled. If “the compact tasks that were scheduled ran into any issues, like task lock contention, or an interval which can not be compacted because of a bug” then the next invocation of scheduler will launch new tasks for those same datasources and time chunks. It will do so because the scheduler maintains no state that would tell it that those very time chunks just failed. The new tasks may also fail. The result is “auto-compaction would be stuck on the cluster.” The design works well if the number of datasource and segments is small relative to the number of available compaction tasks. That assumption is not valid on larger systems. The design also assumes that new ingestion tasks will seldom cancel compaction tasks, hence a start-from-the-top approach will make progress. That assumption is not valid on a system with late-arriving data. Yet, late-arriving data is a fact of life in many systems. Again, this code needs a redesign. But, that is a major project. So, what short-term solution can we apply instead? The obvious solution is to allow the list (the iterator) to persist, and to have the scheduler work its way through the entire iterator before starting over again. Doing so requires the scheduler to maintain state. Suneet's fix places the iterator state not on the scheduler itself, but rather on a `CompactionSegmentSearchPolicy` instance, which is a bit of a back-door solution: the coordinator duty does not maintain state, but a Guice-injected global dependency does. The original code built the iterator anew on each scheduler invocation. The new code hedges its bets: it does not build the iterator on each call, but rather after some amount of time. There is much discussion about the configuration of this refresh period. No doubt it would be quite hard for anyone to come up with a good number. Perhaps one simple solution is to omit the refresh period. Instead, the scheduler works its way through one entire list (iterator) before looping back to the start. Let's call this "one cycle". The full-cycle approach ensures all time chunks have a shot at compaction. If a task fails during one cycle, that chunk will be tried again in the next cycle, perhaps after any active ingestion has completed. Failures in compaction will occur once per cycle, not once per scheduler invocation. Compaction will no longer become "stuck." A risk, of course, is that the information in the iterator becomes stale. The system is distributed: it should already handle that case: race conditions are to be expected. Another risk is that new data arrives faster than a cycle can run. In this case, no algorithm can solve the problem: more compaction resources are the only answer. So, one question is, would the full-cycle idea work in practice? The current PR is clearly an improvement. A redesign, as outlined by @imply-cheddar and @kfaraz would be a project, made more of a challenge because tests in this area appear to be sparse. The code is essentially untestable without a cluster because it is non-modular and is tightly coupled with other parts of the system. So, a second question is: can we accept this short-term fix or would we rather wait indefinitely for someone to tackle the sorely-needed redesign? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
