Yes that makes sense, but it doesn't make the jobs CPU-bound. What is the bottleneck? the model building or other stages? I would think you can get the model building to be CPU bound, unless you have chopped it up into really small partitions. I think it's best to look further into what stages are slow, and what it seems to be spending time on -- GC? I/O?
On Fri, Feb 20, 2015 at 12:18 PM, Dirceu Semighini Filho <dirceu.semigh...@gmail.com> wrote: > Hi Sean, > I'm trying to increase the cpu usage by running logistic regression in > different datasets in parallel. They shouldn't depend on each other. > I train several logistic regression models from different column > combinations of a main dataset. I processed the combinations in a ParArray > in an attempt to increase cpu usage but id did not help. > > > > 2015-02-20 8:17 GMT-02:00 Sean Owen <so...@cloudera.com>: > >> It sounds like your computation just isn't CPU bound, right? or maybe >> that only some stages are. It's not clear what work you are doing >> beyond the core LR. >> >> Stages don't wait on each other unless one depends on the other. You'd >> have to clarify what you mean by running stages in parallel, like what >> are the interdependencies. >> >> On Fri, Feb 20, 2015 at 10:01 AM, Dirceu Semighini Filho >> <dirceu.semigh...@gmail.com> wrote: >> > Hi all, >> > I'm running Spark 1.2.0, in Stand alone mode, on different cluster and >> > server sizes. All of my data is cached in memory. >> > Basically I have a mass of data, about 8gb, with about 37k of columns, >> > and >> > I'm running different configs of an BinaryLogisticRegressionBFGS. >> > When I put spark to run on 9 servers (1 master and 8 slaves), with 32 >> > cores >> > each. I noticed that the cpu usage was varying from 20% to 50% (counting >> > the cpu usage of 9 servers in the cluster). >> > First I tried to repartition the Rdds to the same number of total client >> > cores (256), but that didn't help. After I've tried to change the >> > property *spark.default.parallelism >> > * to the same number (256) but that didn't helped to increase the cpu >> > usage. >> > Looking at the spark monitoring tool, I saw that some stages took 52s >> > to >> > be completed. >> > My last shot was trying to run some tasks in parallel, but when I start >> > running tasks in parallel (4 tasks) the total cpu time spent to complete >> > this has increased in about 10%, task parallelism didn't helped. >> > Looking at the monitoring tool I've noticed that when running tasks in >> > parallel, the stages complete together, if I have 4 stages running in >> > parallel (A,B,C and D), if A, B and C finishes, they will wait for D to >> > mark all this 4 stages as completed, is that right? >> > Is there any way to improve the cpu usage when running on large servers? >> > Spending more time when running tasks is an expected behaviour? >> > >> > Kind Regards, >> > Dirceu > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org