gianm commented on issue #12929: URL: https://github.com/apache/druid/issues/12929#issuecomment-1227920128
This kind of approach is definitely great for performance. The main reason we haven't done it yet is that it can be too fiddly for most users. A key problem is: what happens if the producer doesn't correctly do the hash codes? What if we want to add partitions to the kafka topic? We would want to build something that the average user could use easily, and correctly. Something I had thought of in the past, but never implemented, was to do something like you suggest, but add an extra safety step that detects if partitioning has been done correctly by the kafka producer. If it has, then we write shard specs that are pruneable; otherwise we write ones that are not pruneable. Another option would be to use Bloom filters for pruning rather than hash function, that way we are flexible to whatever way data has been partitioned by the producer. Bloom filters could be too big to stuff into a shard spec, though. I'm wondering if you have thoughts around how this could be done in a way that is robust to the user doing partitioning "wrong", adding partitions, etc? Btw, I'm not sure if you are doing this already, but after handoff you can use reindexing / compaction to rewrite data with `range` partitioning and get pruneable shard specs. We do this in production for some datasources that originate from Kafka. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
