Thanks for the response. So if I understand correctly, the design is such that all the user and item IDs must fit in memory?
On Mon, Apr 6, 2015 at 3:15 PM, Pat Ferrel <[email protected]> wrote: > Try allocating more memory on executors with -sem 8g. More or less as > needed. The only large non-rdd objects (rdds can be partially disk based) > are guava HashBiMaps, these are broadcast to all workers so one copy is > kept on every node. They grow with the memory needed to store all your user > and item ids as string values. This means the number of items or users is > not unlimited but is seldom reached on the type of cluster machines you get > from AWS or other vendor. Ask yourself how much memory to store all user > and item ids as strings. > > The error below may not be where all the memory is used. > > > On Apr 6, 2015, at 11:04 AM, Andres Perez <[email protected]> wrote: > > Hi. I was trying to run the spark-itemsimilarity job on a dataset of mine, > and am running into java heap space OOM errors (stack trace below). I > noticed that the job fails on a reduce step that refers to > SparkEngine.scala:79, which reduces on a += operation. Presumably this is > building a DenseVector to contain all the similarity scores against all > items for each item. Is the list meant to be stored in memory? If so, it > would seem that the job will fail if we have a number of items in the tens > of thousands (my guess). Is this a correct assumption? I'm including my > stack trace: > > java.lang.OutOfMemoryError: Java heap space > at java.util.Arrays.copyOf(Arrays.java:2271) > at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) > at > java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) > at java.io.DataOutputStream.writeLong(DataOutputStream.java:224) > at java.io.DataOutputStream.writeDouble(DataOutputStream.java:259) > at > > org.apache.mahout.math.VectorWritable.writeVectorContents(VectorWritable.java:213) > at > org.apache.mahout.math.MatrixWritable.writeMatrix(MatrixWritable.java:194) > at org.apache.mahout.math.MatrixWritable.write(MatrixWritable.java:56) > at > > org.apache.mahout.sparkbindings.io.WritableKryoSerializer.write(WritableKryoSerializer.scala:29) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568) > at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:37) > at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:33) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568) > at > > org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:128) > at > > org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:195) > at > > org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$4$$anonfun$apply$2.apply(ExternalSorter.scala:751) > at > > org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$4$$anonfun$apply$2.apply(ExternalSorter.scala:750) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > > org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$4.apply(ExternalSorter.scala:750) > at > > org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$4.apply(ExternalSorter.scala:746) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > > org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:746) > at > > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:70) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) > at > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > > Thanks, > > Andy > >
