Hi Akhil,

I do have 25 partitions being created. I have set
the spark.default.parallelism property to 25. Batch size is 30 seconds and
block interval is 1200 ms which also gives us roughly 25 partitions from
the input stream. I can see 25 partitions being created and used in the
Spark UI also. Its just that those tasks are waiting for cores on N1 to get
free before being scheduled while N2 is sitting idle.

The cluster configuration is:

N1: 2 workers, 6 cores and 16gb memory => 12 cores on the first node.
N2: 2 workers, 8 cores and 16 gb memory => 16 cores on the second node.

for a grand total of 28 cores. But it still does most of the processing on
N1 (divided among the 2 workers running) but almost completely disregarding
N2 until its the final stage where data is being written out to
Elasticsearch. I am not sure I understand the reason behind it not
distributing more partitions to N2 to begin with and use it effectively.
Since there are only 12 cores on N1 and 25 total partitions, shouldn't it
send some of those partitions to N2 as well?


On Fri, Sep 25, 2015 at 5:28 AM, Akhil Das <ak...@sigmoidanalytics.com>

> Parallel tasks totally depends on the # of partitions that you are having,
> if you are not receiving sufficient partitions (partitions > total # cores)
> then try to do a .repartition.
> Thanks
> Best Regards
> On Fri, Sep 25, 2015 at 1:44 PM, N B <nb.nos...@gmail.com> wrote:
>> Hello all,
>> I have a Spark streaming application that reads from a Flume Stream, does
>> quite a few maps/filters in addition to a few reduceByKeyAndWindow and join
>> operations before writing the analyzed output to ElasticSearch inside a
>> foreachRDD()...
>> I recently started to run this on a 2 node cluster (Standalone) with the
>> driver program directly submitting to Spark master on the same host. The
>> way I have divided the resources is as follows:
>> N1: spark Master + driver + flume + 2 spark workers (16gb + 6 cores each
>> worker)
>> N2: 2 spark workers (16 gb + 8 cores each worker).
>> The application works just fine but it is underusing N2 completely. It
>> seems to use N1 (note that both executors on N1 get used) for all the
>> analytics but when it comes to writing to Elasticsearch, it does divide the
>> data around into all 4 executors which then write to ES on a separate host.
>> I am puzzled as to why the data is not being distributed evenly from the
>> get go into all 4 executors and why would it only do so in the final step
>> of the pipeline which seems counterproductive as well?
>> CPU usage on N1 is near the peak while on N2 is < 10% of overall capacity.
>> Any help in getting the resources more evenly utilized on N1 and N2 is
>> welcome.
>> Thanks in advance,
>> Nikunj

Reply via email to