[GitHub] spark issue #11863: [SPARK-12177][Streaming][Kafka] Update KafkaDStreams to ...

koeninger Thu, 23 Jun 2016 11:50:18 -0700

Github user koeninger commented on the issue:

    https://github.com/apache/spark/pull/11863
  
    1.  When the user has no preferences, the system already does figure out
    preferred locations, and not in a random way as you claimed.
    
    2.  So lets talk concretely, not hypothetically.  If we publish an api
    where the constructor takes
    () => Connector,
    and we provide two simple ways for users to get an instance of that type,
    e.g.
    constructorFactory(listOfTopics)
    and
    constructorFactory(fromOffsets)
    
    What is actually going to break when the Kafka project adds a new
    subscribeAccordingToThePhaseOfTheMoon(moons) method to Consumer?  The
    people using our simple factories go on about their business. The people
    who are creating a consumer themselves can use the phase of the moon if
    they want to, with a pretty minimal amount of change.
    
    Non-hypothetically, the new Consumer already has a method for dynamic topic
    subscription, which addresses some long-standing issues with the way the
    0.8 consumer works.  Cutting people off of this because you're afraid of
    something breaking makes no sense.  If people want to use something they
    know is stable, with exactly the same features as the 0.8 connector....
    they can still use the 0.8 connector with 0.10 brokers.
    
    3.  Again, concretely not hypothetically.
    
    You're saying if only we had e.g. introduced a SparkWrappedMessage, and
    made the 0.8 consumer messageHandler be
    SparkWrappedMessage => R
    instead of using the kafka class
    MessageAndMetadata => R
    all of this api change wouldn't have been necessary.
    
    This is demonstrably false.  It would not have prevented api change.  The
    behavior of the underlying consumer _changed_.  It changed in such a way
    that we no longer have individual access to a message as its being
    deserialized, because the consumer pre-fetches messages in blocks every
    time it finishes a poll.  No amount of wrapping and hiding changes that.
    
    I understand you've been burned on e.g. leaking classes from a myriad of
    3rd party dependencies in core spark.  But the very purpose of this
    standalone jar is to connect to kafka... the behavior allowed by the kafka
    classes isn't incidental leakage, it's the whole point.
    
    From my point of view, your stated goal is to minimize change.
    
    My stated goal is to make sure people can use Kafka and Spark to get their
    jobs done.
    
    I'm demonstrably willing to do the maintenance work to make this happen,
    even if things unavoidably change.  So are the other people who have worked
    on this ticket since December of last year.
    
    
    
    
    
    
    
    
    
    On Thu, Jun 23, 2016 at 12:39 PM, Tathagata Das <notificati...@github.com>
    wrote:
    
    >
    >    1.
    >
    >    I didnt quite get it when you meant "But your description of what the
    >    code is currently doing
    >    is not accurate, and your recommendation does not meet the use cases."
    >    I just collapsed the three cases into two - when the user has NO
    >    PREFERENCES (the system SHOULD figure out how to schedule partitions 
on the
    >    same executors consistently), and SOME PREFERENCES (because of 
co-located
    >    brokers, or skew, or whatever). Why doesnt this recommendation meet the
    >    criteria?
    >    2.
    >
    >    I agree with the argument that there are whole lot of stuff you cannot
    >    do without exposing a () => Consumer function. Buts thats where the
    >    question of API stability comes in. At this late stage of 2.0 release, 
I
    >    would much rather provide simpler API for simpler usecases than we know
    >    will not break, rather than an API that supports everything is more 
prone
    >    to breaking if Kafka breaks API. We can always start simple and then 
add
    >    more advanced interfaces in the future.
    >    3.
    >
    >    Wrapping things up with extra Spark classes and interfaces is a cost
    >    we have to pay in order to prevent API breaking in the future. It is an
    >    investment we are undertaking in every part of Spark - SparkSession 
(using
    >    a builder pattern, instead of exposing constructor), SQL Data sources
    >    (never expose any 3rd party classes), etc. Its hard-learnt lesson.
    >
    > â
    > You are receiving this because you were mentioned.
    > Reply to this email directly, view it on GitHub
    > <https://github.com/apache/spark/pull/11863#issuecomment-228124879>, or 
mute
    > the thread
    > 
<https://github.com/notifications/unsubscribe/AAGAB5oB1G2I12GLilSrqXzd0DZnd6emks5qOsTBgaJpZM4H1Pg1>
    > .
    >




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #11863: [SPARK-12177][Streaming][Kafka] Update KafkaDStreams to ...

Reply via email to