Github user koeninger commented on the issue: https://github.com/apache/spark/pull/15102 > It would be nice to be able to do something other than earliest/latest. That's what Assign and the starting offset arguments to the Subscribe strategies are for. The implementation was already there. > When specifying earliest, you end up with really big partitions. Again, spark.streaming.kafka.maxRatePerPartition and the associated implementation was already there. If you don't want the coupling to time, it's pretty straightforward. The bigger question is when / if / how you're going to do backpressure. > One question, is it a problem if two tasks are pulling from the same topic partition in parallel? Does this break the assumptions of our caching? This breaks fundamental assumptions of Kafka (per-topicpartition ordering) and really shouldn't be done. > I can do a final pass over the code, but do we think we are getting close to something that we can merge and iterate on? I think we're in much better shape than when we started, but I still honestly think this implementation made a bunch of user-visible behavioral and configuration changes from the DStream that really have nothing to do with the inherent differences between it and structured streaming. This isn't just me whining about "you changed my code", it really is going to make it harder to explain to people and harder to maintain.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org