[GitHub] spark issue #11863: [SPARK-12177][Streaming][Kafka] Update KafkaDStreams to ...

koeninger Sat, 25 Jun 2016 11:43:39 -0700

Github user koeninger commented on the issue:

    https://github.com/apache/spark/pull/11863
  
    Sorry for the delayed reply, I had travel plans that had to be canceled due
    to a family emergency (everyone's mostly ok).
    
    1 + 2, I understand that preferred locations is not a guarantee.  Caching
    should be a performance issue, not a correctness issue.  It's limited in
    size so executors on the "wrong" hosts should eventually get phased out if
    space is an issue.  I can add more comments as to the intention.
    
    3 + 4, I agree that there should be an easy way for people to just specify
    a list of topics without knowing about how Consumer works.  I agree that if
    the convenience constructors are in KafkaUtils, the methods in the
    companion objects for DirectKafkaInputDStream / KafkaRDD aren't necessary.
    
    We can work out the specifics of the easy vs advanced api, but as long as
    there's a way to get access to all of the Consumer behavior for advanced
    users, I'm on board.  A few things I notice about the specifics of your
    suggestion:
    
    return type of DStream[(K, V)]:  This cant just be a tuple of (key, value),
    at least for advanced users, because there's additional per-message
    metadata, timestamp being a big one.  That's currently
    ConsumerRecord[K,V].  If you need it to be a wrapped class that's fine as
    long as it has the same fields, but that's another object instantiation for
    each message.
    
    TopicPartitionOffsets:  I agree with you that reducing overloads would be
    good.  I agree with you that the old way we were using auto.offset.reset
    was a little weird, because the simple consumer doesn't actually read that
    parameter.  In this case however, the new consumer does read that parameter
    and does use it to determine initial starting point (in the absence of a
    specific seek to a specific offset).  Having more than one way to specify
    the same thing is probably going to be more confusing, not less confusing.
    I can probably come up with something that's a similar simple API, but it
    may not look exactly like that.
    
    So to summarize
    
    - I'll start tweaking things
    - Let me know if you think a wrapped class for ConsumerRecord is worth the
    per-message overhead
    - Let me know if you're 100% attached to TopicPartitionOffsets
    
    
    On Fri, Jun 24, 2016 at 12:47 PM, Tathagata Das <[email protected]>
    wrote:
    
    > @koeninger <https://github.com/koeninger> Ping!
    >
    > â
    > You are receiving this because you were mentioned.
    > Reply to this email directly, view it on GitHub
    > <https://github.com/apache/spark/pull/11863#issuecomment-228413198>, or 
mute
    > the thread
    > 
<https://github.com/notifications/unsubscribe/AAGAByQTQRImbjHpr4KDulcvYRkWN7Hyks5qPBg7gaJpZM4H1Pg1>
    > .
    >




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #11863: [SPARK-12177][Streaming][Kafka] Update KafkaDStreams to ...

Reply via email to