aruraghuwanshi opened a new issue, #19038: URL: https://github.com/apache/druid/issues/19038
### Description Add an optional mechanism to automatically sample (drop a configurable fraction of) incoming rows during streaming ingestion when the ingestion rate exceeds a configurable threshold. When the rate is below the threshold, all rows would be ingested as normal; when it exceeds the threshold, the system would apply sampling to reduce load. This would allow users to maintain ingestion stability during traffic spikes or bursts, trading some data completeness for system stability when scaling out (via the existing task autoscaler) is not sufficient or desirable. If you have a detailed implementation in mind and wish to contribute that implementation yourself, and the change that you are planning would require a 'Design Review' tag because it introduces or changes some APIs, or it is large and imposes lasting consequences on the codebase, please open a Proposal instead. ### Motivation **Use case:** Streaming ingestion from Kafka or Kinesis can experience sudden spikes in event volume. When incoming rate exceeds what the cluster can sustainably process, lag accumulates and tasks may fail, backpressure builds, or the system becomes unstable. Today, Druid addresses this primarily by scaling out (adding more ingestion tasks via the autoscaler), but there are scenarios where: - Additional task capacity is not immediately available (e.g., worker slots constrained) - Users prefer to sample during bursts rather than risk ingestion failures or growing lag - The use case tolerates approximate data (e.g., metrics, sampling-friendly analytics) **Rationale:** Providing a built-in, configurable way to sample when rate exceeds a threshold would give operators a predictable, observable fallback when the stream overwhelms available capacity. It would complement (not replace) the existing autoscaling approach—users could configure both and have sampling kick in only when scaling alone is insufficient. **Benefit:** Users would have a declarative way to prioritize ingestion stability over data completeness during overload, with sampled events tracked in existing metrics (e.g., `ingest/events/thrownAway` with a distinct reason) for visibility. ### Implementation considerations Any implementation would likely build on existing building blocks: the row-level filtering applied during streaming ingestion (e.g., `InputRowFilter`, `FilteringCloseableInputRowIterator`), the throughput and thrown-away metrics already exposed per task (`RowIngestionMeters`, `DropwizardRowIngestionMeters` with its moving averages), and the streaming tuning config for new parameters (`SeekableStreamIndexTaskTuningConfig`). The cost-based and lag-based autoscalers already consume task-level stats for scaling decisions—similar metrics could inform when sampling is warranted. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
