[GitHub] [druid] 599166320 opened a new issue, #12929: Add Kafka hash partition type to improve query performance

GitBox Sat, 20 Aug 2022 00:20:29 -0700


599166320 opened a new issue, #12929:
URL: https://github.com/apache/druid/issues/12929


   ### Motivation
   
   We use druid to store a large amount of monitoring data. By default, the 
Kafka index service is used to ingest data, and the query performance is very 
poor.
   
   After analysis, by default, the monitoring data will be randomly written 
into each partition of Kafka. The data will be randomly consumed by Kafka peon, 
and the index will be established to generate segments.
   
   When querying, the broker can only filter segments by time, and cannot 
further cut them according to the query conditions. It needs to scan a large 
number of real-time nodes, historical nodes, and a large number of segments, 
resulting in poor performance.
   
   
   ### Proposed changes
   
   1. Kafka real-time index service supports a new partition type, adding 
`KafkaPartitionNumberedShardSpec` class, which extents from `NumberedShardSpec` 
,Override the `possibleInDomain` function in the 
`KafkaPartitionNumberedShardSpec` class, implement hash filtering, and add the 
following core fields:`type = "kafka_partition", kafkapartitionids, 
partitiondimensions`.
   
   2. Modify the `KafkaIndexTask` class to support the 
`KafkaPartitionNumberedShardSpec` type in `kafkaindextask.newdriver`.
   
   3. Add the `partitionfunction` field in the configuration of Kafka 
ingression spec, and configure all fields that need to be hashed in the 
`partitionfunction`.
   
   4.  Add a simple hash code to the data production side:
   
   ```
   int p = hash(dim1,dim2,...)
   ProducerRecord<byte[], byte[]>(topic, p,event)
   ```
   
   
   ### Rationale
   When writing data, partition the data, which is beneficial to compression 
and sorting. During the query, the scanning range is further trimmed by the 
user-defined filter conditions to optimize the performance of concurrent 
queries.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] 599166320 opened a new issue, #12929: Add Kafka hash partition type to improve query performance

Reply via email to