Re: Kafka - streaming from multiple topics

Cody Koeninger Mon, 21 Dec 2015 09:39:53 -0800

Different spark-submit per topic.

On Mon, Dec 21, 2015 at 11:36 AM, Neelesh <neele...@gmail.com> wrote:


> Thanks Cody.  My case is #2. Just wanted to confirm when you say different
> spark jobs, do you mean one  spark-submit per topic, or just use different
> threads in the driver to submit the job?
>
> Thanks!
>
> On Mon, Dec 21, 2015 at 8:05 AM, Cody Koeninger <c...@koeninger.org>
> wrote:
>
>> Spark streaming by default wont start the next batch until the current
>> batch is completely done, even if only a few cores are still working.  This
>> is generally a good thing, otherwise you'd have weird ordering issues.
>>
>> Each topicpartition is separate.
>>
>> Unbalanced partitions can happen either because 1. your hash while
>> producing into kafka is bad (e.g. hash on customer id and 90% of your
>> traffic is from one customer);  or 2. because you're doing a map over
>> several topics with wildly different message rates.  If you're having the
>> first problem, you just need to fix it.  If you're having the second
>> problem, use different spark jobs for different topics.
>>
>> On Sun, Dec 20, 2015 at 2:28 PM, Neelesh <neele...@gmail.com> wrote:
>>
>>> @Chris,
>>>     There is a 1-1 mapping b/w spark partitions & kafka partitions out
>>> of the box . One can break it by repartitioning of course and add more
>>> parallelism, but that has its own issues around consumer offset management-
>>>  when do I commit the offsets, for example. While its trivial to increase
>>> the number of Kafka partitions to achieve higher parallelism, it is
>>> inherently static and cannot solve the problem of traffic bursts.  If it
>>> were possible to repartition data from a single kafka partition and still
>>> be able to handle consumer offset management when all child partitions that
>>> make up the kafka partition succeeded/failed, it would solve the problem.
>>>
>>> @Cody, would love to hear your thoughts around this.
>>>
>>> On Sun, Dec 20, 2015 at 8:37 AM, Chris Fregly <ch...@fregly.com> wrote:
>>>
>>>> separating out your code into separate streaming jobs - especially when
>>>> there are no dependencies between the jobs - is almost always the best
>>>> route.  it's easier to combine atoms (fusion), then split them (fission).
>>>>
>>>> I recommend splitting out jobs along batch window, stream window, and
>>>> state-tracking characteristics.
>>>>
>>>> for example, imagine 3 separate jobs for the following:
>>>>
>>>> 1) light storing of raw data into Cassandra (500ms batch interval)
>>>> 2) medium aggregations/window roll ups (2000ms batch interval)
>>>> 3) heavy training a ML model (10000ms batch interval).
>>>>
>>>> and reminder that you can control (isolate or combine) the spark
>>>> resources used by these separate, single purpose streaming jobs using
>>>> scheduler pools just like your batch spark jobs.
>>>>
>>>> @cody:  curious about neelesh's question, as well.  does the Kafka
>>>> Direct Stream API treat each Kafka Topic Partition separately in terms of
>>>> parallel retrieval?
>>>>
>>>> more context:  within a Kafka Topic partition, Kafka guarantees order,
>>>> but not total ordering across partitions.  this is normal and expected.
>>>>
>>>> so I assume the the Kafka Direct Streaming connector can retrieve (and
>>>> recover/retry) from separate partitions in parallel and still maintain the
>>>> ordering guarantees offered by Kafka.
>>>>
>>>> if this is true, then I'd suggest @neelesh create more partitions
>>>> within the Kafka Topic to improve parallelism - same as any distributed,
>>>> partitioned data processing engine including spark.
>>>>
>>>> if this is not true, is there a technical limitation to prevent this
>>>> parallelism within the connector?
>>>>
>>>> On Dec 19, 2015, at 5:51 PM, Neelesh <neele...@gmail.com> wrote:
>>>>
>>>> A related issue -  When I put multiple topics in a single stream, the
>>>> processing delay is as bad as the slowest task in the number of tasks
>>>> created. Even though the topics are unrelated to each other, RDD at time
>>>> "t1" has to wait for the RDD at "t0"  is fully executed,  even if most
>>>> cores are idling, and  just one task is still running and the rest of them
>>>> have completed. Effectively, a lightly loaded topic gets the worst deal
>>>> because of a heavily loaded topic
>>>>
>>>> Is my understanding correct?
>>>>
>>>>
>>>>
>>>> On Thu, Dec 17, 2015 at 9:53 AM, Cody Koeninger <c...@koeninger.org>
>>>> wrote:
>>>>
>>>>> You could stick them all in a single stream, and do mapPartitions,
>>>>> then switch on the topic for that partition.  It's probably cleaner to do
>>>>> separate jobs, just depends on how you want to organize your code.
>>>>>
>>>>> On Thu, Dec 17, 2015 at 11:11 AM, Jean-Pierre OCALAN <
>>>>> jpoca...@gmail.com> wrote:
>>>>>
>>>>>> Hi Cody,
>>>>>>
>>>>>> First of all thanks for the note about
>>>>>> spark.streaming.concurrentJobs. I guess this is why it's not mentioned in
>>>>>> the actual spark streaming doc.
>>>>>> Since those 3 topics contain completely different data on which I
>>>>>> need to apply different kind of transformations, I am not sure joining 
>>>>>> them
>>>>>> would be really efficient, unless you know something that I don't.
>>>>>>
>>>>>> As I really don't need any interaction between those streams, I think
>>>>>> I might end up running 3 different streaming apps instead of one.
>>>>>>
>>>>>> Thanks again!
>>>>>>
>>>>>> On Thu, Dec 17, 2015 at 11:43 AM, Cody Koeninger <c...@koeninger.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Using spark.streaming.concurrentJobs for this probably isn't a good
>>>>>>> idea, as it allows the next batch to start processing before current 
>>>>>>> one is
>>>>>>> finished, which may have unintended consequences.
>>>>>>>
>>>>>>> Why can't you use a single stream with all the topics you care
>>>>>>> about, or multiple streams if you're e.g. joining them?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Dec 16, 2015 at 3:00 PM, jpocalan <jpoca...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Nevermind, I found the answer to my questions.
>>>>>>>> The following spark configuration property will allow you to process
>>>>>>>> multiple KafkaDirectStream in parallel:
>>>>>>>> --conf spark.streaming.concurrentJobs=<something greater than 1>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> View this message in context:
>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-streaming-from-multiple-topics-tp8678p25723.html
>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>> Nabble.com <http://nabble.com>.
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> jean-pierre ocalan
>>>>>> jpoca...@gmail.com
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Kafka - streaming from multiple topics

Reply via email to