Re: Spark Streaming 1.3 & Kafka Direct Streams

Tathagata Das Wed, 01 Apr 2015 12:48:04 -0700

The challenge of opening up these internal classes to public (even with
Developer API tag) is that it prevents us from making non-trivial changes
without breaking API compatibility for all those who had subclassed. Its a
tradeoff that is hard to optimize. That's why we favor exposing more
optional parameters in the stable API (KafkaUtils) so that we can maintain
binary compatibility with user code as well as allowing us to make
non-trivial changes internally.


That said, it may be worthwhile to actually take an optional compute
function as a parameter through the KafkaUtils, as Cody suggested ( (Time,
current offsets, kafka metadata, etc) => Option[KafkaRDD]). Worth thinking
about its implications in the context of the driver restarts, etc (as those
function will get called again on restart, and different return value from
before can screw up semantics).

TD

On Wed, Apr 1, 2015 at 12:28 PM, Neelesh <neele...@gmail.com> wrote:

> +1 for subclassing. its more flexible if we can  subclass the
> implementation classes.
>  On Apr 1, 2015 12:19 PM, "Cody Koeninger" <c...@koeninger.org> wrote:
>
>> As I said in the original ticket, I think the implementation classes
>> should be exposed so that people can subclass and override compute() to
>> suit their needs.
>>
>> Just adding a function from Time => Set[TopicAndPartition] wouldn't be
>> sufficient for some of my current production use cases.
>>
>> compute() isn't really a function from Time => Option[KafkaRDD], it's a
>> function from (Time, current offsets, kafka metadata, etc) =>
>> Option[KafkaRDD]
>>
>> I think it's more straightforward to give access to that additional state
>> via subclassing than it is to add in more callbacks for every possible use
>> case.
>>
>>
>>
>>
>> On Wed, Apr 1, 2015 at 2:01 PM, Tathagata Das <t...@databricks.com>
>> wrote:
>>
>>> We should be able to support that use case in the direct API. It may be
>>> as simple as allowing the users to pass on a function that returns the set
>>> of topic+partitions to read from.
>>> That is function (Time) => Set[TopicAndPartition] This gets called every
>>> batch interval before the offsets are decided. This would allow users to
>>> add topics, delete topics, modify partitions on the fly.
>>>
>>> What do you think Cody?
>>>
>>>
>>>
>>>
>>> On Wed, Apr 1, 2015 at 11:57 AM, Neelesh <neele...@gmail.com> wrote:
>>>
>>>> Thanks Cody!
>>>>
>>>> On Wed, Apr 1, 2015 at 11:21 AM, Cody Koeninger <c...@koeninger.org>
>>>> wrote:
>>>>
>>>>> If you want to change topics from batch to batch, you can always just
>>>>> create a KafkaRDD repeatedly.
>>>>>
>>>>> The streaming code as it stands assumes a consistent set of topics
>>>>> though.  The implementation is private so you cant subclass it without
>>>>> building your own spark.
>>>>>
>>>>> On Wed, Apr 1, 2015 at 1:09 PM, Neelesh <neele...@gmail.com> wrote:
>>>>>
>>>>>> Thanks Cody, that was really helpful.  I have a much better
>>>>>> understanding now. One last question -  Kafka topics  are initialized 
>>>>>> once
>>>>>> in the driver, is there an easy way of adding/removing topics on the fly?
>>>>>> KafkaRDD#getPartitions() seems to be computed only once, and no way of
>>>>>> refreshing them.
>>>>>>
>>>>>> Thanks again!
>>>>>>
>>>>>> On Wed, Apr 1, 2015 at 10:01 AM, Cody Koeninger <c...@koeninger.org>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md
>>>>>>>
>>>>>>> The kafka consumers run in the executors.
>>>>>>>
>>>>>>> On Wed, Apr 1, 2015 at 11:18 AM, Neelesh <neele...@gmail.com> wrote:
>>>>>>>
>>>>>>>> With receivers, it was pretty obvious which code ran where - each
>>>>>>>> receiver occupied a core and ran on the workers. However, with the new
>>>>>>>> kafka direct input streams, its hard for me to understand where the 
>>>>>>>> code
>>>>>>>> that's reading from kafka brokers runs. Does it run on the driver (I 
>>>>>>>> hope
>>>>>>>> not), or does it run on workers?
>>>>>>>>
>>>>>>>> Any help appreciated
>>>>>>>> thanks!
>>>>>>>> -neelesh
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>

Re: Spark Streaming 1.3 & Kafka Direct Streams

Reply via email to