[ https://issues.apache.org/jira/browse/SPARK-26841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849454#comment-16849454 ]
Tomas Bartalos commented on SPARK-26841: ---------------------------------------- Hi [~Yohan123], Those queries are useful when you want to query data directly from Kafka. Let's say Kafka have retention of 1 week but you're interested in data for the last 4 hours to create a live report. The current spark implementation will do "full-table" scan of the whole week and filter the last 4 hours in spark, which is very inefficient. My proposal is to push the timestamp filter down to Kafka during initial query and get back only 4 hours of data, to make queries fast. > Timestamp pushdown on Kafka table > --------------------------------- > > Key: SPARK-26841 > URL: https://issues.apache.org/jira/browse/SPARK-26841 > Project: Spark > Issue Type: Improvement > Components: Input/Output > Affects Versions: 2.4.0 > Reporter: Tomas Bartalos > Priority: Major > Labels: Kafka, pushdown, timestamp > > As a Spark user I'd like to have fast queries on Kafka table restricted by > timestamp. > I'd like to have quick answers on questions like: > * What was inserted to Kafka in past x minutes > * What was inserted to Kafka in specified time range > Example: > {quote}select * from kafka_table where timestamp > > from_unixtime(unix_timestamp() - 5 * 60, "YYYY-MM-dd HH:mm:ss") > select * from kafka_table where timestamp > $from_time and timestamp < > $end_time > {quote} > Currently timestamp restrictions are not pushdown to KafkaRelation and > querying by timestamp on a large Kafka topic takes forever to complete. > *Technical solution* > Technically its possible to retrieve Kafka's offsets by provided timestamp > with org.apache.kafka.clients.consumer.Consumer#offsetsForTimes(..) method. > Afterwards we can query Kafka topic by retrieved timestamp ranges. > Querying by timestamp range is already implemented so this change should have > minor impact. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org