Hi all,
I'm running Spark 1.2.0, in Stand alone mode, on different cluster and
server sizes. All of my data is cached in memory.
Basically I have a mass of data, about 8gb, with about 37k of columns, and
I'm running different configs of an BinaryLogisticRegressionBFGS.
When I put spark to run on 9 servers (1 master and 8 slaves), with 32 cores
each. I noticed that the cpu usage was varying from 20% to 50% (counting
the cpu usage of 9 servers in the cluster).
First I tried to repartition the Rdds to the same number of total client
cores (256), but that didn't help. After I've tried to change the
property *spark.default.parallelism
* to the same number (256) but that didn't helped to increase the cpu usage.
Looking at the spark monitoring tool, I saw that some stages  took 52s to
be completed.
My last shot was trying to run some tasks in parallel, but when I start
running tasks in parallel (4 tasks) the total cpu time spent to complete
this has increased in about 10%, task parallelism didn't helped.
Looking at the monitoring tool I've noticed that when running tasks in
parallel, the stages complete together, if I have 4 stages running in
parallel (A,B,C and D), if A, B and C finishes, they will wait for D to
mark all this 4 stages as completed, is that right?
Is there any way to improve the cpu usage when running on large servers?
Spending more time when running tasks is an expected behaviour?

Kind Regards,
Dirceu

Reply via email to