[ 
https://issues.apache.org/jira/browse/SPARK-16473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16193675#comment-16193675
 ] 

Ilya Matiach commented on SPARK-16473:
--------------------------------------

[~podongfeng] interesting - it looks like the dataset representation is somehow 
changing when it is cached?  My guess is that the row order may be changing or 
the numeric values may be changing?  The test failure itself is ok if the 
number of clusters is equal to k (which is actually perfectly fine for the 
algorithm), it just means that the dataset was not generated correctly to hit 
the very special edge case I was looking for, where one cluster is empty after 
a split in bisecting k-means.  I can't seem to see the test failure error 
message in your PR, could you run another build and post it here?  We may need 
to add some debugging/print statements everywhere to determine how the data is 
changing when you cache it - this doesn't mean there is any bug in the 
algorithm, it just means the test needs to be changed so that the test data, 
even after caching, is the same as the original one.

> BisectingKMeans Algorithm failing with java.util.NoSuchElementException: key 
> not found
> --------------------------------------------------------------------------------------
>
>                 Key: SPARK-16473
>                 URL: https://issues.apache.org/jira/browse/SPARK-16473
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, MLlib
>    Affects Versions: 1.6.1, 2.0.0
>         Environment: AWS EC2 linux instance. 
>            Reporter: Alok Bhandari
>            Assignee: Ilya Matiach
>             Fix For: 2.1.1, 2.2.0
>
>
> Hello , 
> I am using apache spark 1.6.1. 
> I am executing bisecting k means algorithm on a specific dataset .
> Dataset details :- 
> K=100,
> input vector =100K*100k
> Memory assigned 16GB per node ,
> number of nodes =2.
>  Till K=75 it os working fine , but when I set k=100 , it fails with 
> java.util.NoSuchElementException: key not found. 
> *I suspect it is failing because of lack of some resources , but somehow 
> exception does not convey anything as why this spark job failed.* 
> Please can someone point me to root cause of this exception , why it is 
> failing. 
> This is the exception stack-trace:- 
> {code}
> java.util.NoSuchElementException: key not found: 166 
>         at scala.collection.MapLike$class.default(MapLike.scala:228) 
>         at scala.collection.AbstractMap.default(Map.scala:58) 
>         at scala.collection.MapLike$class.apply(MapLike.scala:141) 
>         at scala.collection.AbstractMap.apply(Map.scala:58) 
>         at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply$mcDJ$sp(BisectingKMeans.scala:338)
>         at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply(BisectingKMeans.scala:337)
>         at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply(BisectingKMeans.scala:337)
>         at 
> scala.collection.TraversableOnce$$anonfun$minBy$1.apply(TraversableOnce.scala:231)
>  
>         at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
>  
>         at scala.collection.immutable.List.foldLeft(List.scala:84) 
>         at 
> scala.collection.LinearSeqOptimized$class.reduceLeft(LinearSeqOptimized.scala:125)
>  
>         at scala.collection.immutable.List.reduceLeft(List.scala:84) 
>         at 
> scala.collection.TraversableOnce$class.minBy(TraversableOnce.scala:231) 
>         at scala.collection.AbstractTraversable.minBy(Traversable.scala:105) 
>         at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1.apply(BisectingKMeans.scala:337)
>  
>         at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1.apply(BisectingKMeans.scala:334)
>  
>         at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
>         at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389) 
> {code}
> Issue is that , it is failing but not giving any explicit message as to why 
> it failed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to