A related issue -  When I put multiple topics in a single stream, the
processing delay is as bad as the slowest task in the number of tasks
created. Even though the topics are unrelated to each other, RDD at time
"t1" has to wait for the RDD at "t0"  is fully executed,  even if most
cores are idling, and  just one task is still running and the rest of them
have completed. Effectively, a lightly loaded topic gets the worst deal
because of a heavily loaded topic

Is my understanding correct?



On Thu, Dec 17, 2015 at 9:53 AM, Cody Koeninger <c...@koeninger.org> wrote:

> You could stick them all in a single stream, and do mapPartitions, then
> switch on the topic for that partition.  It's probably cleaner to do
> separate jobs, just depends on how you want to organize your code.
>
> On Thu, Dec 17, 2015 at 11:11 AM, Jean-Pierre OCALAN <jpoca...@gmail.com>
> wrote:
>
>> Hi Cody,
>>
>> First of all thanks for the note about spark.streaming.concurrentJobs. I
>> guess this is why it's not mentioned in the actual spark streaming doc.
>> Since those 3 topics contain completely different data on which I need to
>> apply different kind of transformations, I am not sure joining them would
>> be really efficient, unless you know something that I don't.
>>
>> As I really don't need any interaction between those streams, I think I
>> might end up running 3 different streaming apps instead of one.
>>
>> Thanks again!
>>
>> On Thu, Dec 17, 2015 at 11:43 AM, Cody Koeninger <c...@koeninger.org>
>> wrote:
>>
>>> Using spark.streaming.concurrentJobs for this probably isn't a good
>>> idea, as it allows the next batch to start processing before current one is
>>> finished, which may have unintended consequences.
>>>
>>> Why can't you use a single stream with all the topics you care about, or
>>> multiple streams if you're e.g. joining them?
>>>
>>>
>>>
>>> On Wed, Dec 16, 2015 at 3:00 PM, jpocalan <jpoca...@gmail.com> wrote:
>>>
>>>> Nevermind, I found the answer to my questions.
>>>> The following spark configuration property will allow you to process
>>>> multiple KafkaDirectStream in parallel:
>>>> --conf spark.streaming.concurrentJobs=<something greater than 1>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-streaming-from-multiple-topics-tp8678p25723.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>
>>
>>
>> --
>> jean-pierre ocalan
>> jpoca...@gmail.com
>>
>
>

Reply via email to