[ https://issues.apache.org/jira/browse/SYSTEMML-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15510944#comment-15510944 ]
Mike Dusenberry commented on SYSTEMML-946: ------------------------------------------ There was an additional GC limit error: {code} java.lang.OutOfMemoryError: GC overhead limit exceeded at org.apache.sysml.runtime.matrix.data.SparseRow.<init>(SparseRow.java:49) at org.apache.sysml.runtime.matrix.data.SparseBlockMCSR.allocate(SparseBlockMCSR.java:122) at org.apache.sysml.runtime.util.FastBufferedDataInputStream.readSparseRows(FastBufferedDataInputStream.java:219) at org.apache.sysml.runtime.matrix.data.MatrixBlock.readSparseBlock(MatrixBlock.java:1934) at org.apache.sysml.runtime.matrix.data.MatrixBlock.readFields(MatrixBlock.java:1861) at org.apache.sysml.runtime.matrix.data.MatrixBlock.readExternal(MatrixBlock.java:2384) at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1842) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1799) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76) at org.apache.spark.serializer.DeserializationStream.readValue(Serializer.scala:171) at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.readNextItem(ExternalAppendOnlyMap.scala:478) at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.hasNext(ExternalAppendOnlyMap.scala:498) at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:847) at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.org$apache$spark$util$collection$ExternalAppendOnlyMap$ExternalIterator$$readNextHashCode(ExternalAppendOnlyMap.scala:299) at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator$$anonfun$5.apply(ExternalAppendOnlyMap.scala:279) at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator$$anonfun$5.apply(ExternalAppendOnlyMap.scala:277) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.<init>(ExternalAppendOnlyMap.scala:277) at org.apache.spark.util.collection.ExternalAppendOnlyMap.iterator(ExternalAppendOnlyMap.scala:253) at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:60) at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:83) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69) {code} > OOM on spark dataframe-matrix / csv-matrix conversion > ----------------------------------------------------- > > Key: SYSTEMML-946 > URL: https://issues.apache.org/jira/browse/SYSTEMML-946 > Project: SystemML > Issue Type: Bug > Components: Runtime > Reporter: Matthias Boehm > > The decision on dense/sparse block allocation in our dataframeToBinaryBlock > and csvToBinaryBlock data converters is purely based on the sparsity. This > works very well for the common case of tall & skinny matrices. However, for > scenarios with dense data but huge number of columns a single partition will > rarely have 1000 rows to fill an entire row of blocks. This leads to > unnecessary allocation and dense-sparse conversion as well as potential > out-of-memory errors because the temporary memory requirement can be up to > 1000x larger than the input partition. -- This message was sent by Atlassian JIRA (v6.3.4#6332)