tomasbartalos commented on issue #23749: [SPARK-26841][SQL] Kafka timestamp pushdown URL: https://github.com/apache/spark/pull/23749#issuecomment-473835581 > IMO, we could adopt the more generic solution. If its just timestamp based filtering of start and end offsets maybe the timestamp offset approach proposed in #23747 looks straightforward. However if there are more cases that the filter pushdown might be able to handle we should go with that. I am assuming the filter condition can appear anywhere in the query and get pushed down to filter the rows and if so looks more generic and avoids having to add extra options to the kafka source. > > I am not sure providing two different options for timestamp based filtering is necessary. If we support both, the user can provide different values via the options and the filter and it gets very confusing. @arunmahadevan, IMHO, I would divide this to 2 use cases: - **streaming** - specifying the offsets with DS option as proposed in #23747 makes sense, because the start & end ranges are static and they doesn't change during the lifetime of the app - **sql query** - it's more straightforward and flexible to put the conditions to "where" part of the query. If the ts ranges are static its possible to create a view with static timestamp filter and fully cover #23747 functionality. However for use cases with dynamic ts ranges this approach is more suitable. It's not necessary to recreate a DF or a table from scratch just because we need to filter on different TS ranges. Our specific use-case (from which this PR emerged) is to see the last 30 minutes of Kafka. With having a timestamp pushdown and a dynamic view: ``` CREATE or replace VIEW kafka_30_min as select value, timestamp from kafka_source where timestamp > cast(from_unixtime(unix_timestamp() - 10 * 60, "YYYY-MM-dd HH:mm:ss") as TIMESTAMP);` ``` I can repeatedly query the same view and always get up-to date last 30 minutes of Kafka.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
