Are you specifying the number of reducers in all the DStream.****ByKey
operations? If the reduce by key is not set, then the number of reducers
used in the stages can keep changing across batches.

TD


On Wed, Jul 9, 2014 at 4:05 PM, Bill Jay <bill.jaypeter...@gmail.com> wrote:

> Hi all,
>
> I have a Spark streaming job running on yarn. It consume data from Kafka
> and group the data by a certain field. The data size is 480k lines per
> minute where the batch size is 1 minute.
>
> For some batches, the program sometimes take more than 3 minute to finish
> the groupBy operation, which seems slow to me. I allocated 300 workers and
> specify 300 as the partition number for groupby. When I checked the slow
> stage *"combineByKey at ShuffledDStream.scala:42",* there are sometimes 2
> executors allocated for this stage. However, during other batches, the
> executors can be several hundred for the same stage, which means the number
> of executors for the same operations change.
>
> Does anyone know how Spark allocate the number of executors for different
> stages and how to increase the efficiency for task? Thanks!
>
> Bill
>

Reply via email to