Thanks for your answers. The dataset is only 400MB, so I shouldn't run out of memory. I restructured my code now, because I forgot to cache my dataset and set down number of iterations to 2, but still get kicked out of Spark. Did I cache the data wrong (sorry not an expert):
scala> import org.apache.spark.mllib.clustering.KMeans import org.apache.spark.mllib.clustering.KMeans scala> import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.linalg.Vectors scala> scala> // Load and parse the data scala> val data = sc.textFile("data/outkmeanssm.txt") 14/08/07 19:59:10 INFO MemoryStore: ensureFreeSpace(35456) called with curMem=0, maxMem=318111744 14/08/07 19:59:10 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 34.6 KB, free 303.3 MB) data: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at <console>:14 scala> val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))) parsedData: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MappedRDD[2] at map at <console>:16 scala> val train = parsedData.cache() train: parsedData.type = MappedRDD[2] at map at <console>:16 scala> scala> // Set model scala> val model = new KMeans() model: org.apache.spark.mllib.clustering.KMeans = org.apache.spark.mllib.clustering.KMeans@4c5fa12d scala> .setInitializationMode("k-means||") res0: org.apache.spark.mllib.clustering.KMeans = org.apache.spark.mllib.clustering.KMeans@4c5fa12d scala> .setK(2) res1: org.apache.spark.mllib.clustering.KMeans = org.apache.spark.mllib.clustering.KMeans@4c5fa12d scala> .setMaxIterations(2) res2: org.apache.spark.mllib.clustering.KMeans = org.apache.spark.mllib.clustering.KMeans@4c5fa12d scala> .setEpsilon(1e-4) res3: org.apache.spark.mllib.clustering.KMeans = org.apache.spark.mllib.clustering.KMeans@4c5fa12d scala> .setRuns(1) res4: org.apache.spark.mllib.clustering.KMeans = org.apache.spark.mllib.clustering.KMeans@4c5fa12d scala> .run(train) 14/08/07 19:59:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/08/07 19:59:22 WARN LoadSnappy: Snappy native library not loaded 14/08/07 19:59:22 INFO FileInputFormat: Total input paths to process : 1 14/08/07 19:59:22 INFO SparkContext: Starting job: takeSample at KMeans.scala:260 14/08/07 19:59:22 INFO DAGScheduler: Got job 0 (takeSample at KMeans.scala:260) with 7 output partitions (allowLocal=false) 14/08/07 19:59:22 INFO DAGScheduler: Final stage: Stage 0(takeSample at KMeans.scala:260) 14/08/07 19:59:22 INFO DAGScheduler: Parents of final stage: List() 14/08/07 19:59:22 INFO DAGScheduler: Missing parents: List() 14/08/07 19:59:22 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[6] at map at KMeans.scala:123), which has no missing parents 14/08/07 19:59:22 INFO DAGScheduler: Submitting 7 missing tasks from Stage 0 (MappedRDD[6] at map at KMeans.scala:123) 14/08/07 19:59:22 INFO TaskSchedulerImpl: Adding task set 0.0 with 7 tasks 14/08/07 19:59:22 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on executor localhost: localhost (PROCESS_LOCAL) 14/08/07 19:59:22 INFO TaskSetManager: Serialized task 0.0:0 as 2224 bytes in 1 ms 14/08/07 19:59:22 INFO TaskSetManager: Starting task 0.0:1 as TID 1 on executor localhost: localhost (PROCESS_LOCAL) 14/08/07 19:59:22 INFO TaskSetManager: Serialized task 0.0:1 as 2224 bytes in 0 ms 14/08/07 19:59:22 INFO TaskSetManager: Starting task 0.0:2 as TID 2 on executor localhost: localhost (PROCESS_LOCAL) 14/08/07 19:59:22 INFO TaskSetManager: Serialized task 0.0:2 as 2224 bytes in 0 ms 14/08/07 19:59:22 INFO TaskSetManager: Starting task 0.0:3 as TID 3 on executor localhost: localhost (PROCESS_LOCAL) 14/08/07 19:59:22 INFO TaskSetManager: Serialized task 0.0:3 as 2224 bytes in 1 ms 14/08/07 19:59:22 INFO TaskSetManager: Starting task 0.0:4 as TID 4 on executor localhost: localhost (PROCESS_LOCAL) 14/08/07 19:59:22 INFO TaskSetManager: Serialized task 0.0:4 as 2224 bytes in 0 ms 14/08/07 19:59:22 INFO TaskSetManager: Starting task 0.0:5 as TID 5 on executor localhost: localhost (PROCESS_LOCAL) 14/08/07 19:59:22 INFO TaskSetManager: Serialized task 0.0:5 as 2224 bytes in 1 ms 14/08/07 19:59:22 INFO TaskSetManager: Starting task 0.0:6 as TID 6 on executor localhost: localhost (PROCESS_LOCAL) 14/08/07 19:59:22 INFO TaskSetManager: Serialized task 0.0:6 as 2224 bytes in 0 ms 14/08/07 19:59:22 INFO Executor: Running task ID 1 14/08/07 19:59:22 INFO Executor: Running task ID 4 14/08/07 19:59:22 INFO Executor: Running task ID 3 14/08/07 19:59:22 INFO Executor: Running task ID 2 14/08/07 19:59:22 INFO Executor: Running task ID 0 14/08/07 19:59:22 INFO Executor: Running task ID 5 14/08/07 19:59:22 INFO Executor: Running task ID 6 14/08/07 19:59:22 INFO BlockManager: Found block broadcast_0 locally 14/08/07 19:59:22 INFO BlockManager: Found block broadcast_0 locally 14/08/07 19:59:22 INFO BlockManager: Found block broadcast_0 locally 14/08/07 19:59:22 INFO BlockManager: Found block broadcast_0 locally 14/08/07 19:59:22 INFO BlockManager: Found block broadcast_0 locally 14/08/07 19:59:22 INFO BlockManager: Found block broadcast_0 locally 14/08/07 19:59:22 INFO BlockManager: Found block broadcast_0 locally 14/08/07 19:59:22 INFO CacheManager: Partition rdd_2_1 not found, computing it 14/08/07 19:59:22 INFO CacheManager: Partition rdd_2_6 not found, computing it 14/08/07 19:59:22 INFO CacheManager: Partition rdd_2_5 not found, computing it 14/08/07 19:59:22 INFO CacheManager: Partition rdd_2_0 not found, computing it 14/08/07 19:59:22 INFO CacheManager: Partition rdd_2_2 not found, computing it 14/08/07 19:59:22 INFO CacheManager: Partition rdd_2_3 not found, computing it 14/08/07 19:59:22 INFO CacheManager: Partition rdd_2_4 not found, computing it 14/08/07 19:59:22 INFO HadoopRDD: Input split: file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:201326592+24305610 14/08/07 19:59:22 INFO HadoopRDD: Input split: file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:33554432+33554432 14/08/07 19:59:22 INFO HadoopRDD: Input split: file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:67108864+33554432 14/08/07 19:59:22 INFO HadoopRDD: Input split: file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:134217728+33554432 14/08/07 19:59:22 INFO HadoopRDD: Input split: file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:0+33554432 14/08/07 19:59:22 INFO HadoopRDD: Input split: file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:100663296+33554432 14/08/07 19:59:22 INFO HadoopRDD: Input split: file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:167772160+33554432 14/08/07 20:00:40 ERROR Executor: Exception in task ID 0 java.lang.OutOfMemoryError: Java heap space at scala.collection.mutable.ResizableArray$class.ensureSize(ResizableArray.scala:99) at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:47) at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:83) at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:47) at scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48) at scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:107) at org.apache.spark.rdd.RDD.iterator(RDD.scala:227) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.ZippedRDD.compute(ZippedRDD.scala:66) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:695) 14/08/07 20:00:40 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-0,5,main] java.lang.OutOfMemoryError: Java heap space at scala.collection.mutable.ResizableArray$class.ensureSize(ResizableArray.scala:99) at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:47) at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:83) at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:47) at scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48) at scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:107) at org.apache.spark.rdd.RDD.iterator(RDD.scala:227) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.ZippedRDD.compute(ZippedRDD.scala:66) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:695) 14/08/07 20:00:40 WARN TaskSetManager: Lost TID 0 (task 0.0:0) 14/08/07 20:00:41 WARN TaskSetManager: Loss was due to java.lang.OutOfMemoryError java.lang.OutOfMemoryError: Java heap space at scala.collection.mutable.ResizableArray$class.ensureSize(ResizableArray.scala:99) at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:47) at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:83) at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:47) at scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48) at scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:107) at org.apache.spark.rdd.RDD.iterator(RDD.scala:227) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.ZippedRDD.compute(ZippedRDD.scala:66) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:695) 14/08/07 20:00:41 ERROR TaskSetManager: Task 0.0:0 failed 1 times; aborting job Chairs-MacBook-Pro:spark-1.0.0 admin$ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-Input-Format-tp11654p11698.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org