dw-genesys opened a new issue, #18827:
URL: https://github.com/apache/druid/issues/18827

   ### Description
   Currently (at least in Druid 28, I'm unsure if it's changed since,) when an 
invalid ingestion spec is received, the supervisor is deleted. This can cause a 
bug in ingestion spec submission/generation to quickly turn into a considerable 
outage. Instead, if an invalid supervisor spec is submitted then Druid should 
keep the supervisor around using the previous config and log the error.
   
   ### Motivation
   I accidentally caused an overnight outage with our Druid ingestion due to a 
bug in supervisor config generation. It submitted a config which did not pass 
validation (taskCountMin was over taskCountMax) which Druid caught, but deleted 
the supervisor because it had no valid configuration. Our alerting didn't catch 
it because the entire supervisor was deleted, so there was no data reporting 
that it was down. This outage would've been prevented if Druid didn't destroy 
supervisors on bad config updates.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to