Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/11863
Sorry for the delayed reply, I had travel plans that had to be canceled due
to a family emergency (everyone's mostly ok).
1 + 2, I understand that preferred locations is not a guarantee. Caching
should be a performance issue, not a correctness issue. It's limited in
size so executors on the "wrong" hosts should eventually get phased out if
space is an issue. I can add more comments as to the intention.
3 + 4, I agree that there should be an easy way for people to just specify
a list of topics without knowing about how Consumer works. I agree that if
the convenience constructors are in KafkaUtils, the methods in the
companion objects for DirectKafkaInputDStream / KafkaRDD aren't necessary.
We can work out the specifics of the easy vs advanced api, but as long as
there's a way to get access to all of the Consumer behavior for advanced
users, I'm on board. A few things I notice about the specifics of your
suggestion:
return type of DStream[(K, V)]: This cant just be a tuple of (key, value),
at least for advanced users, because there's additional per-message
metadata, timestamp being a big one. That's currently
ConsumerRecord[K,V]. If you need it to be a wrapped class that's fine as
long as it has the same fields, but that's another object instantiation for
each message.
TopicPartitionOffsets: I agree with you that reducing overloads would be
good. I agree with you that the old way we were using auto.offset.reset
was a little weird, because the simple consumer doesn't actually read that
parameter. In this case however, the new consumer does read that parameter
and does use it to determine initial starting point (in the absence of a
specific seek to a specific offset). Having more than one way to specify
the same thing is probably going to be more confusing, not less confusing.
I can probably come up with something that's a similar simple API, but it
may not look exactly like that.
So to summarize
- I'll start tweaking things
- Let me know if you think a wrapped class for ConsumerRecord is worth the
per-message overhead
- Let me know if you're 100% attached to TopicPartitionOffsets
On Fri, Jun 24, 2016 at 12:47 PM, Tathagata Das <[email protected]>
wrote:
> @koeninger <https://github.com/koeninger> Ping!
>
> â
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/spark/pull/11863#issuecomment-228413198>, or
mute
> the thread
>
<https://github.com/notifications/unsubscribe/AAGAByQTQRImbjHpr4KDulcvYRkWN7Hyks5qPBg7gaJpZM4H1Pg1>
> .
>
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]