[
https://issues.apache.org/jira/browse/DRILL-5977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16491250#comment-16491250
]
ASF GitHub Bot commented on DRILL-5977:
---------------------------------------
aravi5 commented on issue #1272: DRILL-5977: Filter Pushdown in Drill-Kafka
plugin
URL: https://github.com/apache/drill/pull/1272#issuecomment-392181900
>Yes, IMO the users who is applying these predicates on offsets, should be
aware of offset scope per partition. So, in such cases where offset predicates
without partitionId can be ignored and let it be full scan as these are in
valid queries in kafka perspective. Please let me know your thoughts.
Thanks @akumarb2010 for clarifying. I agree with you that it is uncommon for
user to specify conditions on `offsets` alone but DRILL will support such
queries and will return messages that satisfy the condition (across all
partitions).
My opinion on this is that pushdown for such queries is OKAY and change to
NOT pushdown may not be required for the following reasons -
1. Changes to NOT pushdown would require a LOT more effort and would add a
lot more complexity to the code while providing the same result to the user.
Current implementation of pushdown is such that, when parsing the condition
tree, a scan spec is generated for each predicate independently and then merged
based on how different predicates are joined.
For example `kafkaMsgOffset > 1000 AND kafkaPartitionId > 3` would create a
scan spec (collection of `KafkaPartitionScanSpec`for each partition) for each
of the predicates independently and scan spec is later merged based on `AND`
condition.
With this implementation, creation of scan spec for `kafkaMsgOffset` does
not have to know whether `kafkaPartitionId` is specified and the same applies
to `kafkaPartitionId`.
>These are not drawbacks, but my point is let's not handle this invalid
queries.
2. DRILL supports such queries (though it may not make complete sense in
Kafka world) and applying pushdown for such queries does not have any drawback.
3. Such queries may be common during Data Exploration phase. For example,
user would want to see the first few messages produced to a topic (user may not
know how the messages were distributed across partitions during produce).
For the sake of code simplicity and no user impact in terms of functionality
(just better performance) I am leaning towards applying pushdown uniformly
across partitions if `kafkaPartitionId` is not specified.
Let me know your thoughts.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> predicate pushdown support kafkaMsgOffset
> -----------------------------------------
>
> Key: DRILL-5977
> URL: https://issues.apache.org/jira/browse/DRILL-5977
> Project: Apache Drill
> Issue Type: Improvement
> Reporter: B Anil Kumar
> Assignee: Abhishek Ravi
> Priority: Major
> Fix For: 1.14.0
>
>
> As part of Kafka storage plugin review, below is the suggestion from Paul.
> {noformat}
> Does it make sense to provide a way to select a range of messages: a starting
> point or a count? Perhaps I want to run my query every five minutes, scanning
> only those messages since the previous scan. Or, I want to limit my take to,
> say, the next 1000 messages. Could we use a pseudo-column such as
> "kafkaMsgOffset" for that purpose? Maybe
> SELECT * FROM <some topic> WHERE kafkaMsgOffset > 12345
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)