Structured streaming from Kafka by timestamp

Tomas Bartalos Thu, 24 Jan 2019 09:38:17 -0800

Hello,

I'm trying to read Kafka via spark structured streaming. I'm trying to read
data within specific time range:


select count(*) from kafka_table where timestamp > cast('2019-01-23 1:00' as
TIMESTAMP) and timestamp < cast('2019-01-23 1:01' as TIMESTAMP);


The problem is that timestamp query is not pushed-down to Kafka, so Spark
tries to read the whole topic from beginning.


explain query:

....

         +- *(1) Filter ((isnotnull(timestamp#57) && (timestamp#57 >
1535148000000000)) && (timestamp#57 < 1535234400000000))


Scan KafkaRelation(strategy=Subscribe[keeper.Ticket.avro.v1---production],
start=EarliestOffsetRangeLimit, end=LatestOffsetRangeLimit)
[key#52,value#53,topic#54,partition#55,offset#56L,timestamp#57,timestampType#58]
*PushedFilters: []*, ReadSchema:
struct<key:binary,value:binary,topic:string,partition:int,offset:bigint,timestamp:timestamp,times...


Obviously the query takes forever to complete. Is there a solution to this ?

I'm using kafka and kafka-client version 1.1.1


BR,

Tomas

Structured streaming from Kafka by timestamp

Reply via email to