Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/15102
> It would be nice to be able to do something other than earliest/latest.
That's what Assign and the starting offset arguments to the Subscribe
strategies are for. The implementation was already there.
> When specifying earliest, you end up with really big partitions.
Again, spark.streaming.kafka.maxRatePerPartition and the associated
implementation was already there. If you don't want the coupling to time, it's
pretty straightforward. The bigger question is when / if / how you're going to
do backpressure.
> One question, is it a problem if two tasks are pulling from the same
topic partition in parallel? Does this break the assumptions of our caching?
This breaks fundamental assumptions of Kafka (per-topicpartition ordering)
and really shouldn't be done.
> I can do a final pass over the code, but do we think we are getting close
to something that we can merge and iterate on?
I think we're in much better shape than when we started, but I still
honestly think this implementation made a bunch of user-visible behavioral and
configuration changes from the DStream that really have nothing to do with the
inherent differences between it and structured streaming. This isn't just me
whining about "you changed my code", it really is going to make it harder to
explain to people and harder to maintain.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]