abhishekrb19 commented on PR #19091:
URL: https://github.com/apache/druid/pull/19091#issuecomment-4015453698

   @kfaraz, thank you for fixing these ingestion errors: `Inconsistency between 
stored metadata state [KafkaDataSourceMetadata]`. We’ve seen these recurring 
from time to time in our production clusters as well, causing transient lag 
bursts, etc. A few questions:
   
   1. Is this fix and https://github.com/apache/druid/pull/19034
    only applicable to auto-scaling, or are they broadly applicable when tasks 
roll over with task replicas enabled? We don’t have auto-scalers set up but 
still see these errors, so I’m wondering if these fixes would actually help us.
   
   2. We hardcoded `MAX_RETRIES = 15` in `TransactionalSegmentPublisher` to 
account for slower publish times and occasional spurious task failures as a 
result of these metadata inconsistencies. It has been running in production for 
some time now. Some of the tables ingest a large volume of complex JSON 
objects, so we’d noticed slower handoff times and increasing this seemed to 
help to an extent. 
   
   Is it worth considering making this property configurable rather than 
hardcoding it to 13 in this patch? Alternatively, maybe it could be derived 
dynamically as a function of `completionTimeout`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to