Re: Concurrent batch processing

Tathagata Das Thu, 12 Feb 2015 14:34:35 -0800

So you have come across spark.streaming.concurrentJobs already :)
Yeah, that is an undocumented feature that does allow multiple output
operations to submitted in parallel. However, this is not made public for
the exact reasons that you realized - the semantics in case of stateful
operations is not clear. It is semantically safe to enable, how ever it may
cause redundant computations, as next batches jobs may recompute some RDDs
twice rather than using the cached values of another RDDs.


In general, only in a very few cases is it useful to increase this
concurrency. If batch processing times > batch interval, then you need to
use more resources, and parallelize the ingestion and processing enough to
utilize those resources efficiently.
The spikes that you see despite average hardware utilization is low
probably indicates that the parallelization of the Spark Streaming jobs is
insufficient. There are bunch of optimizations that can be done.
http://spark.apache.org/docs/latest/streaming-programming-guide.html#reducing-the-processing-time-of-each-batch

If you have already done this, can you tell me more about what sort of
utilization and psike do you see, and what sort of parallelization you have
already done?

TD

On Thu, Feb 12, 2015 at 12:09 PM, Matus Faro <matus.f...@kik.com> wrote:

> I've been experimenting with my configuration for couple of days and
> gained quite a bit of power through small optimizations, but it may very
> well be something I'm doing crazy that is causing this problem.
>
> To give a little bit of a background, I am in the early stages of a
> project that consumes a stream of data in the order of 100,000 per second
> that requires processing over a sliding window over one day (ideally a
> week). Spark Streaming is a good candidate but I want to make sure I squash
> any performance issues ahead of time before I commit.
>
> With a 5 second batch size, in 40 minutes, the processing time is also 5
> seconds. I see the CPU spikes over two seconds out of five. I assume the
> sliding window operation is very expensive in this case and that's the root
> cause of this effect.
>
> I should've done a little bit more research before I posted, I just came
> across a post about an undocumented property spark.streaming.concurrentJobs
> that I am about to try. I'm still confused how exactly this works with a
> sliding window where the result of one batch depends on the other. I assume
> the concurrency can only be achieved up until the window action is
> executed. Either way, I am going to give this a try and post back here if
> that doesn't work.
>
> Thanks!
>
>
>
> On Thu, Feb 12, 2015 at 2:55 PM, Arush Kharbanda <
> ar...@sigmoidanalytics.com> wrote:
>
>> It could depend on the nature of your application but spark streaming
>> would use spark internally and concurrency should be there what is your use
>> case?
>>
>>
>> Are you sure that your configuration is good?
>>
>>
>> On Fri, Feb 13, 2015 at 1:17 AM, Matus Faro <matus.f...@kik.com> wrote:
>>
>>> Hi,
>>>
>>> Please correct me if I'm wrong, in Spark Streaming, next batch will
>>> not start processing until the previous batch has completed. Is there
>>> any way to be able to start processing the next batch if the previous
>>> batch is taking longer to process than the batch interval?
>>>
>>> The problem I am facing is that I don't see a hardware bottleneck in
>>> my Spark cluster, but Spark is not able to handle the amount of data I
>>> am pumping through (batch processing time is longer than batch
>>> interval). What I'm seeing is spikes of CPU, network and disk IO usage
>>> which I assume are due to different stages of a job, but on average,
>>> the hardware is under utilized. Concurrency in batch processing would
>>> allow the average batch processing time to be greater than batch
>>> interval while fully utilizing the hardware.
>>>
>>> Any ideas on what can be done? One option I can think of is to split
>>> the application into multiple applications running concurrently and
>>> dividing the initial stream of data between those applications.
>>> However, I would have to lose the benefits of having a single
>>> application.
>>>
>>> Thank you,
>>> Matus
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>>
>> --
>>
>> [image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>
>>
>> *Arush Kharbanda* || Technical Teamlead
>>
>> ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
>>
>
>

Re: Concurrent batch processing

Reply via email to