Re: createDirectStream and Stats

Tathagata Das Fri, 19 Jun 2015 17:17:20 -0700

Also, can you find from the spark UI the break up of the stages in each
batch's jobs, and find which stage is taking more time after a while?




On Fri, Jun 19, 2015 at 4:51 PM, Cody Koeninger <c...@koeninger.org> wrote:

> when you say your old version was
>
> k = createStream .....
>
> were you manually creating multiple receivers?  Because otherwise you're
> only using one receiver on one executor...
>
> If that's the case I'd try direct stream without the repartitioning.
>
>
> On Fri, Jun 19, 2015 at 6:43 PM, Tim Smith <secs...@gmail.com> wrote:
>
>> Essentially, I went from:
>> k = createStream .....
>> val dataout = k.map(x=>myFunc(x._2,someParams))
>> dataout.foreachRDD ( rdd => rdd.foreachPartition(rec => {
>> myOutputFunc.write(rec) })
>>
>> To:
>> kIn = createDirectStream .....
>> k = kIn.repartition(numberOfExecutors) //since #kafka partitions <
>> #spark-executors
>> val dataout = k.map(x=>myFunc(x._2,someParams))
>> dataout.foreachRDD ( rdd => rdd.foreachPartition(rec => {
>> myOutputFunc.write(rec) })
>>
>> With the new API, the app starts up and works fine for a while but I
>> guess starts to deteriorate after a while. With the existing API
>> "createStream", the app does deteriorate but over a much longer period,
>> hours vs days.
>>
>>
>>
>>
>>
>>
>> On Fri, Jun 19, 2015 at 1:40 PM, Tathagata Das <t...@databricks.com>
>> wrote:
>>
>>> Yes, please tell us what operation are you using.
>>>
>>> TD
>>>
>>> On Fri, Jun 19, 2015 at 11:42 AM, Cody Koeninger <c...@koeninger.org>
>>> wrote:
>>>
>>>> Is there any more info you can provide / relevant code?
>>>>
>>>> On Fri, Jun 19, 2015 at 1:23 PM, Tim Smith <secs...@gmail.com> wrote:
>>>>
>>>>> Update on performance of the new API: the new code using the
>>>>> createDirectStream API ran overnight and when I checked the app state in
>>>>> the morning, there were massive scheduling delays :(
>>>>>
>>>>> Not sure why and haven't investigated a whole lot. For now, switched
>>>>> back to the createStream API build of my app. Yes, for the record, this is
>>>>> with CDH 5.4.1 and Spark 1.3.
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jun 18, 2015 at 7:05 PM, Tim Smith <secs...@gmail.com> wrote:
>>>>>
>>>>>> Thanks for the super-fast response, TD :)
>>>>>>
>>>>>> I will now go bug my hadoop vendor to upgrade from 1.3 to 1.4.
>>>>>> Cloudera, are you listening? :D
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Jun 18, 2015 at 7:02 PM, Tathagata Das <
>>>>>> tathagata.das1...@gmail.com> wrote:
>>>>>>
>>>>>>> Are you using Spark 1.3.x ? That explains. This issue has been fixed
>>>>>>> in Spark 1.4.0. Bonus you get a fancy new streaming UI with more awesome
>>>>>>> stats. :)
>>>>>>>
>>>>>>> On Thu, Jun 18, 2015 at 7:01 PM, Tim Smith <secs...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I just switched from "createStream" to the "createDirectStream" API
>>>>>>>> for kafka and while things otherwise seem happy, the first thing I 
>>>>>>>> noticed
>>>>>>>> is that stream/receiver stats are gone from the Spark UI :( Those stats
>>>>>>>> were very handy for keeping an eye on health of the app.
>>>>>>>>
>>>>>>>> What's the best way to re-create those in the Spark UI? Maintain
>>>>>>>> Accumulators? Would be really nice to get back receiver-like stats even
>>>>>>>> though I understand that "createDirectStream" is a receiver-less 
>>>>>>>> design.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Tim
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: createDirectStream and Stats

Reply via email to