paul-rogers commented on PR #13852:
URL: https://github.com/apache/druid/pull/13852#issuecomment-1455001687

   I’m a newbie to this code so I took a deep dive to understand how it works. 
The compaction scheduling algorithm is rather crude: it isn’t well designed for 
a large, busy, distributed system. @suneet-s ’s fix is a worth-while short term 
fix, but this code really needs a redesign along the lines outlined by 
@imply-cheddar and @kfaraz.
   
   For anyone else new to this area, here’s an overview of the logic. Please 
forgive any newbie misunderstandings. At the highest level:
   
   * The coordinator provides a set of duties which are called one after 
another on a schedule.
   * Duties have Guice-injected global state, but no local state across calls.
   * Each duty is called from the main duty loop and thus must complete quickly.
   * The compaction scheduler is a duty. As such, each call is independent from 
any previous call, and the scheduler must make its decisions quickly.
   
   The challenge, then, is to design the compaction scheduler to be stateless 
and fast. On each call it:
   
   * Determines the number of available slots, accounting for running tasks.
   * Creates a big iterator over all datasources with compaction specs and 
their segments sorted by time chunk, minus any segments locked by existing 
compaction or ingestion tasks.
   * Launches a set of compaction tasks from the head of the iterator until 
slots are full.
   
   In short, each call to the scheduler duty builds a large list of candidate 
actions: “an iterator based on the latest segment metadata available”. The 
scheduler then starts working down that list from the head, stopping when the 
worker slots are filled. If “the compact tasks that were scheduled ran into any 
issues, like task lock contention, or an interval which can not be compacted 
because of a bug” then the next invocation of scheduler will launch new tasks 
for those same datasources and time chunks. It will do so because the scheduler 
maintains no state that would tell it that those very time chunks just failed. 
The new tasks may also fail. The result is “auto-compaction would be stuck on 
the cluster.”
   
   The design works well if the number of datasource and segments is small 
relative to the number of available compaction tasks. That assumption is not 
valid on larger systems.
   
   The design also assumes that new ingestion tasks will seldom cancel 
compaction tasks, hence a start-from-the-top approach will make progress. That 
assumption is not valid on a system with late-arriving data. Yet, late-arriving 
data is a fact of life in many systems.
   
   Again, this code needs a redesign. But, that is a major project. So, what 
short-term solution can we apply instead?
   
   The obvious solution is to allow the list (the iterator) to persist, and to 
have the scheduler work its way through the entire iterator before starting 
over again. Doing so requires the scheduler to maintain state.
   
   Suneet's fix places the iterator state not on the scheduler itself, but 
rather on a `CompactionSegmentSearchPolicy` instance, which is a bit of a 
back-door solution: the coordinator duty does not maintain state, but a 
Guice-injected global dependency does.
   
   The original code built the iterator anew on each scheduler invocation. The 
new code hedges its bets: it does not build the iterator on each call, but 
rather after some amount of time. There is much discussion about the 
configuration of this refresh period. No doubt it would be quite hard for 
anyone to come up with a good number.
   
   Perhaps one simple solution is to omit the refresh period. Instead, the 
scheduler works its way through one entire list (iterator) before looping back 
to the start. Let's call this "one cycle". The full-cycle approach ensures all 
time chunks have a shot at compaction. If a task fails during one cycle, that 
chunk will be tried again in the next cycle, perhaps after any active ingestion 
has completed. Failures in compaction will occur once per cycle, not once per 
scheduler invocation. Compaction will no longer become "stuck."
   
   A risk, of course, is that the information in the iterator becomes stale. 
The system is distributed: it should already handle that case: race conditions 
are to be expected. Another risk is that new data arrives faster than a cycle 
can run. In this case, no algorithm can solve the problem: more compaction 
resources are the only answer.
   
   So, one question is, would the full-cycle idea work in practice?
   
   The current PR is clearly an improvement. A redesign, as outlined by 
@imply-cheddar and @kfaraz would be a project, made more of a challenge because 
tests in this area appear to be sparse. The code is essentially untestable 
without a cluster because it is non-modular and is tightly coupled with other 
parts of the system. So, a second question is: can we accept this short-term 
fix or would we rather wait indefinitely for someone to tackle the 
sorely-needed redesign?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to