[GitHub] spark issue #15102: [SPARK-17346][SQL] Add Kafka source for Structured Strea...

koeninger Thu, 15 Sep 2016 07:37:10 -0700

Github user koeninger commented on the issue:

    https://github.com/apache/spark/pull/15102
  
    * This already does depend on most of the existing Kafka DStream 
implementation.  The fact that most of it was copied wholesale proves that.  If 
you're just saying that you don't want it to have a transitive dependency on 
the spark-streaming-X module, I can refactor that.  I would much rather take 
the time to do that work personally now, as opposed to the maintenance problems 
later.
    
    * Users are going to change topicpartitions whether you want them to or 
not, and this PR ostensibly supports SubscribePattern, so it must handle 
changing topicpartitions.  Kafka does not have a global order  (Kafka doesn't 
even have a per-partition order that's guaranteed contiguous, see SPARK-17147, 
but that's another can of worms).  Given an offset from topicpartition A and an 
offset from topicpartition B, it is impossible to determine ordering without 
some other information.  I brought some of these concerns up during the 
structured streaming design document, and it was handwaved as we'll just make 
structured offset an interface and try to figure it out later.  I believe my PR 
does the least bad thing possible given this interface, in that it does allow 
changing partitions, does not violate hashcode, and does have a stable order 
(if we really want to deal with hashcode collisions, we can sort and concat all 
topicpartitions or something).  If you see something else wrong there,
  by all means let me know.  Yes, if someone publishes data to a topicpartition 
right before deleting it, the stream may not consume that data, but... really, 
what did they expect to happen in that case?  We can talk about whether it 
might be possible to use time indexing in kafka as an alternative ordering, but 
what's there right now doesn't cut it.
    
    My thoughts at this point:
    
    * It's clear from the jira this shouldn't get rushed into 2.0.1, let's do 
this as right as possible given the circumstances.
    * How can we collaborate on a shared branch?  You guys manually copying 
stuff from my fork doesn't make any sense.
    * @marmbrus Can you give some specific technical direction as to how users 
can communicate the type for key and value, without having to map over the 
stream as is done in this PR?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #15102: [SPARK-17346][SQL] Add Kafka source for Structured Strea...

Reply via email to