gianm commented on issue #12929:
URL: https://github.com/apache/druid/issues/12929#issuecomment-1227920128

   This kind of approach is definitely great for performance. The main reason 
we haven't done it yet is that it can be too fiddly for most users. A key 
problem is: what happens if the producer doesn't correctly do the hash codes? 
What if we want to add partitions to the kafka topic? We would want to build 
something that the average user could use easily, and correctly.
   
   Something I had thought of in the past, but never implemented, was to do 
something like you suggest, but add an extra safety step that detects if 
partitioning has been done correctly by the kafka producer. If it has, then we 
write shard specs that are pruneable; otherwise we write ones that are not 
pruneable. Another option would be to use Bloom filters for pruning rather than 
hash function, that way we are flexible to whatever way data has been 
partitioned by the producer. Bloom filters could be too big to stuff into a 
shard spec, though.
   
   I'm wondering if you have thoughts around how this could be done in a way 
that is robust to the user doing partitioning "wrong", adding partitions, etc?
   
   Btw, I'm not sure if you are doing this already, but after handoff you can 
use reindexing / compaction to rewrite data with `range` partitioning and get 
pruneable shard specs. We do this in production for some datasources that 
originate from Kafka.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to