[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

Tathagata Das (JIRA) Tue, 13 Sep 2016 11:59:24 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15488079#comment-15488079
 ]


Tathagata Das commented on SPARK-15406:
---------------------------------------

Here are some thoughts. 

- The key-value types should be communicated as constructor parameters of the 
Kafka Source object. This is how we do it in the FileStreamSource, we pass on 
the schema through the constructor which is used to create the DF in getBatch - 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L39

- custom consumer strategy implementation can be loaded by specifying the class 
name as string and loading by reflection. Same way as Kafka loads 
serializers/deserializers, parquet loads compressing libraries, etc. This is 
may not be the only solution, and we should definitely discuss this in the 
future. 

- I completely agree that asking the users to generate json strings themselves 
is not gonna a cut. that's this needs more focus discussion on this specific 
topic so that we bounce around ideas, ranging from providing simple map-to-json 
helper, to adding extra methods to the datastreamreader. 

These are very good questions which does not have clear answers, and we should 
make specific follow up JIRAs to discuss on them with the community and 
experience Kafka users like you. I am hoping this initial design doc puts in 
the basic framework in without pushing ourselves in a corner. And I think with 
the current design, we dont block any improvements in the future. Please let us 
know if you think we are doing so in any specific use case.


> Structured streaming support for consuming from Kafka
> -----------------------------------------------------
>
>                 Key: SPARK-15406
>                 URL: https://issues.apache.org/jira/browse/SPARK-15406
>             Project: Spark
>          Issue Type: New Feature
>            Reporter: Cody Koeninger
>
> This is the parent JIRA to track all the work for the building a Kafka source 
> for Structured Streaming. Here is the design doc for an initial version of 
> the Kafka Source.
> https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing
> ================== Old description =========================
> Structured streaming doesn't have support for kafka yet.  I personally feel 
> like time based indexing would make for a much better interface, but it's 
> been pushed back to kafka 0.10.1
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

Reply via email to