HeartSaVioR commented on issue #25911: [SPARK-29223][SQL][SS] Enable global timestamp per topic while specifying offset by timestamp in Kafka source URL: https://github.com/apache/spark/pull/25911#issuecomment-537226753 I would say it would help in any case including partition number is small. Kafka data source is not used only for streaming application but also used for batch query, including ad-hoc query. For ad-hoc query, the requirement to know number of partitions is a real burden given not only data engineers run the query, but also data scientists run the query. Some of them even may not (want to) know about number of partitions of topic. (We may need to think out of engineers' perspective.) I wouldn't concern about usability of start/end offsets as I guess the feature wouldn't be used so much. (Who would want to memorize/calculate offset per partition and replay from there? It should be only used for replaying from specific situation, query crashed and unfortunately checkpoint lost.) The feature regarding offsets by timestamp is not. It enables end users to run a batch query to query the Kafka topic by range of timestamp, which is the real case they just want to forget about partitions (say, abstracted away). To support this we are introducing less than 50 lines of complexity (except refactoring - I'm counting only source side, not test side) which doesn't seem to be matter. (My 2 cents, we should concern about addition of complexity, not number of new lines.) Btw, IMHO, #23749 is the ideal approach on dealing with this case (I think it's still valid), though seems like community wanted to deal with such case as "source option". This patch may lose the one of major use case if #23749 is adopted then. (though I still think it also helps to streaming queries.)
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
