Different spark-submit per topic. On Mon, Dec 21, 2015 at 11:36 AM, Neelesh <neele...@gmail.com> wrote:
> Thanks Cody. My case is #2. Just wanted to confirm when you say different > spark jobs, do you mean one spark-submit per topic, or just use different > threads in the driver to submit the job? > > Thanks! > > On Mon, Dec 21, 2015 at 8:05 AM, Cody Koeninger <c...@koeninger.org> > wrote: > >> Spark streaming by default wont start the next batch until the current >> batch is completely done, even if only a few cores are still working. This >> is generally a good thing, otherwise you'd have weird ordering issues. >> >> Each topicpartition is separate. >> >> Unbalanced partitions can happen either because 1. your hash while >> producing into kafka is bad (e.g. hash on customer id and 90% of your >> traffic is from one customer); or 2. because you're doing a map over >> several topics with wildly different message rates. If you're having the >> first problem, you just need to fix it. If you're having the second >> problem, use different spark jobs for different topics. >> >> On Sun, Dec 20, 2015 at 2:28 PM, Neelesh <neele...@gmail.com> wrote: >> >>> @Chris, >>> There is a 1-1 mapping b/w spark partitions & kafka partitions out >>> of the box . One can break it by repartitioning of course and add more >>> parallelism, but that has its own issues around consumer offset management- >>> when do I commit the offsets, for example. While its trivial to increase >>> the number of Kafka partitions to achieve higher parallelism, it is >>> inherently static and cannot solve the problem of traffic bursts. If it >>> were possible to repartition data from a single kafka partition and still >>> be able to handle consumer offset management when all child partitions that >>> make up the kafka partition succeeded/failed, it would solve the problem. >>> >>> @Cody, would love to hear your thoughts around this. >>> >>> On Sun, Dec 20, 2015 at 8:37 AM, Chris Fregly <ch...@fregly.com> wrote: >>> >>>> separating out your code into separate streaming jobs - especially when >>>> there are no dependencies between the jobs - is almost always the best >>>> route. it's easier to combine atoms (fusion), then split them (fission). >>>> >>>> I recommend splitting out jobs along batch window, stream window, and >>>> state-tracking characteristics. >>>> >>>> for example, imagine 3 separate jobs for the following: >>>> >>>> 1) light storing of raw data into Cassandra (500ms batch interval) >>>> 2) medium aggregations/window roll ups (2000ms batch interval) >>>> 3) heavy training a ML model (10000ms batch interval). >>>> >>>> and reminder that you can control (isolate or combine) the spark >>>> resources used by these separate, single purpose streaming jobs using >>>> scheduler pools just like your batch spark jobs. >>>> >>>> @cody: curious about neelesh's question, as well. does the Kafka >>>> Direct Stream API treat each Kafka Topic Partition separately in terms of >>>> parallel retrieval? >>>> >>>> more context: within a Kafka Topic partition, Kafka guarantees order, >>>> but not total ordering across partitions. this is normal and expected. >>>> >>>> so I assume the the Kafka Direct Streaming connector can retrieve (and >>>> recover/retry) from separate partitions in parallel and still maintain the >>>> ordering guarantees offered by Kafka. >>>> >>>> if this is true, then I'd suggest @neelesh create more partitions >>>> within the Kafka Topic to improve parallelism - same as any distributed, >>>> partitioned data processing engine including spark. >>>> >>>> if this is not true, is there a technical limitation to prevent this >>>> parallelism within the connector? >>>> >>>> On Dec 19, 2015, at 5:51 PM, Neelesh <neele...@gmail.com> wrote: >>>> >>>> A related issue - When I put multiple topics in a single stream, the >>>> processing delay is as bad as the slowest task in the number of tasks >>>> created. Even though the topics are unrelated to each other, RDD at time >>>> "t1" has to wait for the RDD at "t0" is fully executed, even if most >>>> cores are idling, and just one task is still running and the rest of them >>>> have completed. Effectively, a lightly loaded topic gets the worst deal >>>> because of a heavily loaded topic >>>> >>>> Is my understanding correct? >>>> >>>> >>>> >>>> On Thu, Dec 17, 2015 at 9:53 AM, Cody Koeninger <c...@koeninger.org> >>>> wrote: >>>> >>>>> You could stick them all in a single stream, and do mapPartitions, >>>>> then switch on the topic for that partition. It's probably cleaner to do >>>>> separate jobs, just depends on how you want to organize your code. >>>>> >>>>> On Thu, Dec 17, 2015 at 11:11 AM, Jean-Pierre OCALAN < >>>>> jpoca...@gmail.com> wrote: >>>>> >>>>>> Hi Cody, >>>>>> >>>>>> First of all thanks for the note about >>>>>> spark.streaming.concurrentJobs. I guess this is why it's not mentioned in >>>>>> the actual spark streaming doc. >>>>>> Since those 3 topics contain completely different data on which I >>>>>> need to apply different kind of transformations, I am not sure joining >>>>>> them >>>>>> would be really efficient, unless you know something that I don't. >>>>>> >>>>>> As I really don't need any interaction between those streams, I think >>>>>> I might end up running 3 different streaming apps instead of one. >>>>>> >>>>>> Thanks again! >>>>>> >>>>>> On Thu, Dec 17, 2015 at 11:43 AM, Cody Koeninger <c...@koeninger.org> >>>>>> wrote: >>>>>> >>>>>>> Using spark.streaming.concurrentJobs for this probably isn't a good >>>>>>> idea, as it allows the next batch to start processing before current >>>>>>> one is >>>>>>> finished, which may have unintended consequences. >>>>>>> >>>>>>> Why can't you use a single stream with all the topics you care >>>>>>> about, or multiple streams if you're e.g. joining them? >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Dec 16, 2015 at 3:00 PM, jpocalan <jpoca...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Nevermind, I found the answer to my questions. >>>>>>>> The following spark configuration property will allow you to process >>>>>>>> multiple KafkaDirectStream in parallel: >>>>>>>> --conf spark.streaming.concurrentJobs=<something greater than 1> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> View this message in context: >>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-streaming-from-multiple-topics-tp8678p25723.html >>>>>>>> Sent from the Apache Spark User List mailing list archive at >>>>>>>> Nabble.com <http://nabble.com>. >>>>>>>> >>>>>>>> >>>>>>>> --------------------------------------------------------------------- >>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> jean-pierre ocalan >>>>>> jpoca...@gmail.com >>>>>> >>>>> >>>>> >>>> >>> >> >