[ 
https://issues.apache.org/jira/browse/DRILL-5977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16491250#comment-16491250
 ] 

ASF GitHub Bot commented on DRILL-5977:
---------------------------------------

aravi5 commented on issue #1272: DRILL-5977: Filter Pushdown in Drill-Kafka 
plugin
URL: https://github.com/apache/drill/pull/1272#issuecomment-392181900
 
 
   >Yes, IMO the users who is applying these predicates on offsets, should be 
aware of offset scope per partition. So, in such cases where offset predicates 
without partitionId can be ignored and let it be full scan as these are in 
valid queries in kafka perspective. Please let me know your thoughts.
   
   Thanks @akumarb2010 for clarifying. I agree with you that it is uncommon for 
user to specify conditions on `offsets` alone but DRILL will support such 
queries and will return messages that satisfy the condition (across all 
partitions). 
   
   My opinion on this is that pushdown for such queries is OKAY and change to 
NOT pushdown may not be required for the following reasons -
   
   1. Changes to NOT pushdown would require a LOT more effort and would add a 
lot more complexity to the code while providing the same result to the user.
   
   Current implementation of pushdown is such that, when parsing the condition 
tree, a scan spec is generated for each predicate independently and then merged 
based on how different predicates are joined. 
   For example `kafkaMsgOffset > 1000 AND kafkaPartitionId > 3` would create a 
scan spec (collection of `KafkaPartitionScanSpec`for each partition) for each 
of the predicates independently and scan spec is later merged based on `AND` 
condition.
   With this implementation, creation of scan spec for `kafkaMsgOffset` does 
not have to know whether `kafkaPartitionId` is specified and the same applies 
to `kafkaPartitionId`. 
   
   >These are not drawbacks, but my point is let's not handle this invalid 
queries.
   
   2. DRILL supports such queries (though it may not make complete sense in 
Kafka world) and applying pushdown for such queries does not have any drawback.
   
   3. Such queries may be common during Data Exploration phase. For example, 
user would want to see the first few messages produced to a topic (user may not 
know how the messages were distributed across partitions during produce).
   
   For the sake of code simplicity and no user impact in terms of functionality 
(just better performance) I am leaning towards applying pushdown uniformly 
across partitions if `kafkaPartitionId` is not specified.
   
   Let me know your thoughts.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> predicate pushdown support kafkaMsgOffset
> -----------------------------------------
>
>                 Key: DRILL-5977
>                 URL: https://issues.apache.org/jira/browse/DRILL-5977
>             Project: Apache Drill
>          Issue Type: Improvement
>            Reporter: B Anil Kumar
>            Assignee: Abhishek Ravi
>            Priority: Major
>             Fix For: 1.14.0
>
>
> As part of Kafka storage plugin review, below is the suggestion from Paul.
> {noformat}
> Does it make sense to provide a way to select a range of messages: a starting 
> point or a count? Perhaps I want to run my query every five minutes, scanning 
> only those messages since the previous scan. Or, I want to limit my take to, 
> say, the next 1000 messages. Could we use a pseudo-column such as 
> "kafkaMsgOffset" for that purpose? Maybe
> SELECT * FROM <some topic> WHERE kafkaMsgOffset > 12345
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to