Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/15102
* This already does depend on most of the existing Kafka DStream
implementation. The fact that most of it was copied wholesale proves that. If
you're just saying that you don't want it to have a transitive dependency on
the spark-streaming-X module, I can refactor that. I would much rather take
the time to do that work personally now, as opposed to the maintenance problems
later.
* Users are going to change topicpartitions whether you want them to or
not, and this PR ostensibly supports SubscribePattern, so it must handle
changing topicpartitions. Kafka does not have a global order (Kafka doesn't
even have a per-partition order that's guaranteed contiguous, see SPARK-17147,
but that's another can of worms). Given an offset from topicpartition A and an
offset from topicpartition B, it is impossible to determine ordering without
some other information. I brought some of these concerns up during the
structured streaming design document, and it was handwaved as we'll just make
structured offset an interface and try to figure it out later. I believe my PR
does the least bad thing possible given this interface, in that it does allow
changing partitions, does not violate hashcode, and does have a stable order
(if we really want to deal with hashcode collisions, we can sort and concat all
topicpartitions or something). If you see something else wrong there,
by all means let me know. Yes, if someone publishes data to a topicpartition
right before deleting it, the stream may not consume that data, but... really,
what did they expect to happen in that case? We can talk about whether it
might be possible to use time indexing in kafka as an alternative ordering, but
what's there right now doesn't cut it.
My thoughts at this point:
* It's clear from the jira this shouldn't get rushed into 2.0.1, let's do
this as right as possible given the circumstances.
* How can we collaborate on a shared branch? You guys manually copying
stuff from my fork doesn't make any sense.
* @marmbrus Can you give some specific technical direction as to how users
can communicate the type for key and value, without having to map over the
stream as is done in this PR?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]