[
https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15488079#comment-15488079
]
Tathagata Das commented on SPARK-15406:
---------------------------------------
Here are some thoughts.
- The key-value types should be communicated as constructor parameters of the
Kafka Source object. This is how we do it in the FileStreamSource, we pass on
the schema through the constructor which is used to create the DF in getBatch -
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L39
- custom consumer strategy implementation can be loaded by specifying the class
name as string and loading by reflection. Same way as Kafka loads
serializers/deserializers, parquet loads compressing libraries, etc. This is
may not be the only solution, and we should definitely discuss this in the
future.
- I completely agree that asking the users to generate json strings themselves
is not gonna a cut. that's this needs more focus discussion on this specific
topic so that we bounce around ideas, ranging from providing simple map-to-json
helper, to adding extra methods to the datastreamreader.
These are very good questions which does not have clear answers, and we should
make specific follow up JIRAs to discuss on them with the community and
experience Kafka users like you. I am hoping this initial design doc puts in
the basic framework in without pushing ourselves in a corner. And I think with
the current design, we dont block any improvements in the future. Please let us
know if you think we are doing so in any specific use case.
> Structured streaming support for consuming from Kafka
> -----------------------------------------------------
>
> Key: SPARK-15406
> URL: https://issues.apache.org/jira/browse/SPARK-15406
> Project: Spark
> Issue Type: New Feature
> Reporter: Cody Koeninger
>
> This is the parent JIRA to track all the work for the building a Kafka source
> for Structured Streaming. Here is the design doc for an initial version of
> the Kafka Source.
> https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing
> ================== Old description =========================
> Structured streaming doesn't have support for kafka yet. I personally feel
> like time based indexing would make for a much better interface, but it's
> been pushed back to kafka 0.10.1
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]