Re: k-means can only run on one executor with one thread?

2015-03-30 Thread Xiangrui Meng
Hey Xi, Have you tried Spark 1.3.0? The initialization happens on the driver node and we fixed an issue with the initialization in 1.3.0. Again, please start with a smaller k, and increase it gradually, Let us know at what k the problem happens. Best, Xiangrui On Sat, Mar 28, 2015 at 3:11 AM,

Re: k-means can only run on one executor with one thread?

2015-03-28 Thread Xi Shen
I have put more detail of my problem at http://stackoverflow.com/questions/29295420/spark-kmeans-computation-cannot-be-distributed It is really appreciate if you can help me take a look at this problem. I have tried various settings and ways to load/partition my data, but I just cannot get rid

Re: k-means can only run on one executor with one thread?

2015-03-28 Thread Reza Zadeh
How many dimensions does your data have? The size of the k-means model is k * d, where d is the dimension of the data. Since you're using k=1000, if your data has dimension higher than say, 10,000, you will have trouble, because k*d doubles have to fit in the driver. Reza On Sat, Mar 28, 2015

Re: k-means can only run on one executor with one thread?

2015-03-28 Thread Xi Shen
My vector dimension is like 360 or so. The data count is about 270k. My driver has 2.9G memory. I attache a screenshot of current executor status. I submitted this job with --master yarn-cluster. I have a total of 7 worker node, one of them acts as the driver. In the screenshot, you can see all

Re: k-means can only run on one executor with one thread?

2015-03-27 Thread Joseph Bradley
Can you try specifying the number of partitions when you load the data to equal the number of executors? If your ETL changes the number of partitions, you can also repartition before calling KMeans. On Thu, Mar 26, 2015 at 8:04 PM, Xi Shen davidshe...@gmail.com wrote: Hi, I have a large

Re: k-means can only run on one executor with one thread?

2015-03-27 Thread Xi Shen
Yes, I have done repartition. I tried to repartition to the number of cores in my cluster. Not helping... I tried to repartition to the number of centroids (k value). Not helping... On Sat, Mar 28, 2015 at 7:27 AM Joseph Bradley jos...@databricks.com wrote: Can you try specifying the number

k-means can only run on one executor with one thread?

2015-03-26 Thread Xi Shen
Hi, I have a large data set, and I expects to get 5000 clusters. I load the raw data, convert them into DenseVector; then I did repartition and cache; finally I give the RDD[Vector] to KMeans.train(). Now the job is running, and data are loaded. But according to the Spark UI, all data are