[GitHub] spark issue #15102: [SPARK-17346][SQL] Add Kafka source for Structured Strea...

marmbrus Wed, 21 Sep 2016 16:50:06 -0700

Github user marmbrus commented on the issue:

    https://github.com/apache/spark/pull/15102
  
    I asked @koeninger to clarify the specific suggestions he is referring to 
above, here's my response:
    
    > [Comments here and on JIRA relating to concerns with the `Offset` 
implementation]
    
    I'd be happy to consider a PR (either against this branch now or against 
master if this patch is merged) to use a different offset for `KafkaSource`, or 
even change the offset API in all of structured streaming. A change that large 
however would need to be a different PR, and should not block this one unless 
there are correctness issues. I'd want to see some test cases though that show 
why the current implementation is wrong from an end-user perspective if it 
needs to block merging initial kafka support.
     
    > [Comments here and on JIRA about using String -> String for configuration]
    
    I'd also be happy to consider PRs to improve the DataFrameReader/Writer 
interface.  That would also need to be in a separate PR.  It would need to work 
well across languages and be backwards compatible (i.e. what happens when you 
are using an older data source with the new configuration system?).
    
    That said, we already have something that we've been using since Spark 1.3, 
even if its not perfect. It does work across languages and I don't see a 
problem with supporting comma separated strings long term for topic lists 
(kafka even does this today in its own configuration for things like 
`bootstrap.servers`.).  As a result, I do not think it is reasonable to suggest 
we block merging this patch on an overhaul of the DataSource API configuration 
system.
    
    I think @koeninger made a good suggestion to block accepting certain kafka 
configurations.  In particular, `auto.offset.reset` does not make sense in a 
world where we are managing the offsets ourselves (kafka documentation suggests 
you set this to false if you maintain offsets externally).  Similarly, letting 
the user set the group id could cause data to get split with a different query 
and thus could affect correctness.  My only question here is if we should have 
a whitelist or a blacklist.  Scanning though the list of possible configuration 
options, I think it could go either way.  I tend to error on this side of more 
power and go with the blacklist.
    
    I disagree that prefixing configuration that should be passed to kafka with 
kafka is a hack.  We do this in other places in the DataFrameReader/Writer API 
and have not had any complaints or confusion.  I could be convinced otherwise 
if someone can point out a case where this would be confusing or ambiguous to a 
user.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #15102: [SPARK-17346][SQL] Add Kafka source for Structured Strea...

Reply via email to