[
https://issues.apache.org/jira/browse/DRILL-7998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Justin Chen updated DRILL-7998:
-------------------------------
Priority: Critical (was: Major)
> Drill queries for Kafka storage plugin returning incorrect/missing result set
> -----------------------------------------------------------------------------
>
> Key: DRILL-7998
> URL: https://issues.apache.org/jira/browse/DRILL-7998
> Project: Apache Drill
> Issue Type: Bug
> Components: Storage - Kafka
> Affects Versions: 1.19.0
> Reporter: Justin Chen
> Priority: Critical
> Attachments: case1_1.png, case1_2.png, case2_1.png, case2_2.png,
> case2_3.png
>
>
> My team and I have experienced two scenarios in which querying Kafka results
> in an incorrect result set. I'm unsure of whether they have the same root
> cause.
> *Case 1:*
> Queries with a ORDER BY clause using kafkaMsgTimestamp return incorrect
> results.
> topic_1 is a topic with 2 partitions, no log compaction, in JSON format.
>
> {code:java}
> SELECT * FROM kafka.`topic_1` ORDER BY kafkaMsgTimestamp DESC LIMIT 10
> {code}
> Image attachment case1_1 shows that the latest kafkaMsgTimestamp was
> 1630631881114 (Fri Sep 03 2021 01:18:01 GMT+0000).
>
> However, applying a pushdown filter using kafkaMsgTimestamp with timestamp
> 1631160000000 (Thu Sep 09 2021 04:00:00 GMT+0000):
>
> {code:java}
> SELECT * FROM kafka.`topic_1` WHERE kafkaMsgTimestamp > 1631160000000 LIMIT 10
> {code}
> Image attachment case1_2 shows that there are many messages with more recent
> timestamps. Thus, ordering on kafkaMsgTimestamp seems to not correct correct
> results.
>
> *Case 2:*
> Queries are not returning correct results when using a WHERE clause unless an
> exact partition id and offset is provided.
> topic_2 is a topic with > 200 partitions, using log compaction and in AVRO
> format.
>
> {code:java}
> SELECT * FROM kafka.`topic_2` WHERE `topic_2`.after.id = 1 AND
> `topic_2`.after.shop_id = 2 LIMIT 1
> {code}
> Image attachment case2_1 shows that no such record exists with id 1 and
> shop_id 2.
> However, we manually confirmed the record exists using a consumer and found
> its kafkaPartitionId and kafkaMsgOffset. Adding an additional WHERE condition
> with kafkaPartitionId and kafkaMsgTimestamp to speed up the query:
> {code:java}
> SELECT * FROM kafka.`topic_2` WHERE `topic_2`.after.id = 1 AND
> `topic_2`.after.shop_id = 2 AND kafkaPartitionId = 110 AND kafkaMsgTimestamp
> > 1628196400000 LIMIT 1{code}
> Image attachment case2_2 shows that the record still cannot be found by
> Drill.
>
> Finally, the exact kafkaMsgOffset was specified, along with kafkaPartitionId
> and kafkaMsgTimestamp:
> {code:java}
> SELECT * FROM kafka.`topic_2` WHERE `topic_2`.after.id = 1 AND
> `topic_2`.after.shop_id = 2 AND kafkaPartitionId = 110 AND kafkaMsgOffset =
> 85785074 AND kafkaMsgTimestamp > 1628196400000 LIMIT 1{code}
> Image attachment case2_3 shows that Drill was only able to find the record
> when an exact partition and message offset was provided.
>
> Is there any explanation for this behavior, or is this a bug? Thank you!
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)