abhishekrb19 commented on PR #19091: URL: https://github.com/apache/druid/pull/19091#issuecomment-4015453698
@kfaraz, thank you for fixing these ingestion errors: `Inconsistency between stored metadata state [KafkaDataSourceMetadata]`. We’ve seen these recurring from time to time in our production clusters as well, causing transient lag bursts, etc. A few questions: 1. Is this fix and https://github.com/apache/druid/pull/19034 only applicable to auto-scaling, or are they broadly applicable when tasks roll over with task replicas enabled? We don’t have auto-scalers set up but still see these errors, so I’m wondering if these fixes would actually help us. 2. We hardcoded `MAX_RETRIES = 15` in `TransactionalSegmentPublisher` to account for slower publish times and occasional spurious task failures as a result of these metadata inconsistencies. It has been running in production for some time now. Some of the tables ingest a large volume of complex JSON objects, so we’d noticed slower handoff times and increasing this seemed to help to an extent. Is it worth considering making this property configurable rather than hardcoding it to 13 in this patch? Alternatively, maybe it could be derived dynamically as a function of `completionTimeout`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
