[GitHub] spark issue #15102: [SPARK-17346][SQL] Add Kafka source for Structured Strea...

marmbrus Thu, 15 Sep 2016 17:09:07 -0700

Github user marmbrus commented on the issue:

    https://github.com/apache/spark/pull/15102
  
    > This already does depend on most of the existing Kafka DStream 
implementation....
    
    I pushed for this code to be copied rather than refactored because I think 
this is the right direction long term.  While it is nice to minimize 
inter-project dependencies, that is not really the motivation.  While the code 
is very similar now, there a bunch of things I'd like to start changing:
     - I don't think that all the classes need to be type parameterized.  Our 
interface  SQL has its own type system, analyser, and interface to the type 
system of the JVM (encoders).  We should be using that.  Operators in SQL do 
not type parameterize in general.
     - To optimize performance, there are several tricks we might want to play 
eventually (maybe prefetching data during execution, etc).
    
    These are just ideas, but given that DStreams and Structured Streaming have 
significantly different models and user interfaces, I don't think that we want 
to tie ourselves to the same internals.  If we identify utilities that are 
needed by both, then we should pull those out and share them.
    
    > Users are going to change topicpartitions whether you want them to or 
not...
    
    I think we need to take a step back here and look at what is actually 
required by the `Offset` interface.  We don't need to handle the general 
problem of is kafka `Offset A` from `Topic 1` before or after kafka `Offset B` 
from `Topic 2`.  We'll only ever look at full `KafkaOffsets` that represent 
slices in time across a couple of topicpartitions and were returned by the 
`Source` (possibly with other annotations the Source wants to add).  The only 
questions that the the `StreamExecution` will ask today are:
     - Does `x: KafkaOffset == y: KafkaOffset` (i.e. is there new data since 
the last time I checked)?
     - Give me the data between `KafkaOffset` x and `KafkaOffset` y for all 
included topicpartitions.
    
    The final version of this `Source` should almost certainly support 
wildcards with topicpartitions that change on the fly.  Since it seems this is 
one of the harder problems to solve.  As a strawman, I'd propose that we only 
support static lists of topics in this PR and possibly even static partitions.  
I want to get users to kick the tires on structured streaming in general and 
report whats missing so we can all prioritize our engineering effort.
    
    > It's clear from the jira this shouldn't get rushed into 2.0.1, let's do 
this as right as possible given the circumstances.
    
    Agreed, if this doesn't make 2.0.1 thats fine with me.
    
    > How can we collaborate on a shared branch? You guys manually copying 
stuff from my fork doesn't make any sense.
    
    I typically open PRs against [the PR author's 
branch](https://github.com/zsxwing/spark/tree/kafka-source) when I want to 
collaborate more directly.
    
    > Can you give some specific technical direction as to how users can 
communicate the type for key and value, without having to map over the stream 
as is done in this PR?
    
    I'd like this to work the same as other calculations in SQL, using column 
expressions.
    
    ```scala
    df.withColumn("value", $"value".cast("string"))
    ```
    
    I'd also like to create expressions that work well with kafka specific 
deserializers, as well as integrations with our json parsing and maybe Avro 
too.  The nice thing about this path is it fits with other SQL APIs and should 
play nicely with Dataset `Encoder`s, UDFs, our expression library, etc.
    
    Does that seem reasonable?  Is this missing important cases?




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #15102: [SPARK-17346][SQL] Add Kafka source for Structured Strea...

Reply via email to