In conclusion, I think current design of the load/drop rules is flexible and able to deal with almost all scenarios. But current `PeriodDropRule` is a impractical rule for it will always drop recent data. Then if people want to `retain 30 days` data, they can not use such `PeriodDropRule` but have to do like: load 30 days, drop forever. And because people have used the `drop forever` rule, then below things occured: > 2. The user loads some data from slightly in the future (maybe some clocks > are running a bit fast or slow) using streaming ingestion. This creates a > segment with an interval that is in the future. > 3. The coordinator disables the segment immediately upon noticing it (since > it is not within the last 30 days). > 4. The Kafka tasks time out during handoff (because the segments are never > loaded). > 5. And after that timeout, the data that was slightly in the future is still > not available!
Then I think there are two ways to solve these things: 1. Period load rules include the future by default 2. Add a new drop rule or modify current `PeriodDropRule` to support `drop before a period`, then if people want to `retain 30 days` data, they can do like this: drop 30 days before, load forever. I prefer the second way and want to modify current `PeriodDropRule` not add a new one because the current one is very impractical, IMO no people would like to use such drop rule. [ Full content available at: https://github.com/apache/incubator-druid/issues/5869 ] This message was relayed via gitbox.apache.org for [email protected]
