[
https://issues.apache.org/jira/browse/SPARK-6706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395639#comment-14395639
]
Sean Owen commented on SPARK-6706:
----------------------------------
I tried your code locally vs master with k=1000 (you say >100, but it works at
500, so I tried 1000), which you can do by building Spark and running the
shell. I don't see it stuck in any {{collect()}} stage; those complete quickly.
But, the driver does bog down for a long long time in {{LocalKMeans}}:
{code}
at com.github.fommil.netlib.F2jBLAS.ddot(F2jBLAS.java:71)
at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:121)
at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:104)
at
org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:311)
at
org.apache.spark.mllib.clustering.KMeans$.fastSquaredDistance(KMeans.scala:522)
at
org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:496)
at
org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:490)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at
scala.collection.GenSeqViewLike$Sliced$class.foreach(GenSeqViewLike.scala:42)
at
scala.collection.mutable.IndexedSeqView$$anon$2.foreach(IndexedSeqView.scala:80)
at
org.apache.spark.mllib.clustering.KMeans$.findClosest(KMeans.scala:490)
at org.apache.spark.mllib.clustering.KMeans$.pointCost(KMeans.scala:513)
at
org.apache.spark.mllib.clustering.LocalKMeans$$anonfun$kMeansPlusPlus$1$$anonfun$3.apply(LocalKMeans.scala:53)
at
org.apache.spark.mllib.clustering.LocalKMeans$$anonfun$kMeansPlusPlus$1$$anonfun$3.apply(LocalKMeans.scala:52)
at
scala.collection.GenTraversableViewLike$Mapped$$anonfun$foreach$2.apply(GenTraversableViewLike.scala:81)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at
scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42)
at
scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43)
at
scala.collection.GenTraversableViewLike$Mapped$class.foreach(GenTraversableViewLike.scala:80)
at scala.collection.SeqViewLike$$anon$3.foreach(SeqViewLike.scala:78)
at
scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
at
scala.collection.SeqViewLike$AbstractTransformed.foldLeft(SeqViewLike.scala:43)
at scala.collection.TraversableOnce$class.sum(TraversableOnce.scala:203)
at
scala.collection.SeqViewLike$AbstractTransformed.sum(SeqViewLike.scala:43)
at
org.apache.spark.mllib.clustering.LocalKMeans$$anonfun$kMeansPlusPlus$1.apply$mcVI$sp(LocalKMeans.scala:54)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at
org.apache.spark.mllib.clustering.LocalKMeans$.kMeansPlusPlus(LocalKMeans.scala:49)
at
org.apache.spark.mllib.clustering.KMeans$$anonfun$22.apply(KMeans.scala:396)
at
org.apache.spark.mllib.clustering.KMeans$$anonfun$22.apply(KMeans.scala:393)
{code}
I think this is what Derrick was getting at in SPARK-3220, that this bit
doesn't scale.
> kmeans|| hangs for a long time if both k and vector dimension are large
> -----------------------------------------------------------------------
>
> Key: SPARK-6706
> URL: https://issues.apache.org/jira/browse/SPARK-6706
> Project: Spark
> Issue Type: Bug
> Components: MLlib
> Affects Versions: 1.2.1, 1.3.0
> Environment: Windows 64bit, Linux 64bit
> Reporter: Xi Shen
> Assignee: Xiangrui Meng
> Labels: performance
> Attachments: kmeans-debug.7z
>
>
> When doing k-means cluster with the "kmeans||" algorithm which is the default
> one. The algorithm hangs at some "collect" step for a long time.
> Settings:
> - k above 100
> - feature dimension about 360
> - total data size is about 100 MB
> The issue was first noticed with Spark 1.2.1. I tested with both local and
> cluster mode. On Spark 1.3.0. I, I can also reproduce this issue with local
> mode. **However, I do not have a 1.3.0 cluster environment for me to test.**
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]