[
https://issues.apache.org/jira/browse/SPARK-26228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16706706#comment-16706706
]
Chen Lin commented on SPARK-26228:
----------------------------------
Exception in thread "main" java.lang.OutOfMemoryError: Requested array size
exceeds VM limit
at java.util.Arrays.copyOf(Arrays.java:3236)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
at
org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
at
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
at
java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
at
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:342)
at
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:335)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2292)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2124)
at org.apache.spark.rdd.RDD$$anonfun$fold$1.apply(RDD.scala:1092)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.fold(RDD.scala:1086)
at
org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1155)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1131)
at
org.apache.spark.mllib.linalg.distributed.RowMatrix.computeGramianMatrix(RowMatrix.scala:123)
at
org.apache.spark.mllib.linalg.distributed.RowMatrix.computeCovariance(RowMatrix.scala:345)
at
org.apache.spark.mllib.stat.correlation.PearsonCorrelation$.computeCorrelationMatrix(PearsonCorrelation.scala:49)
at
org.apache.spark.mllib.stat.correlation.Correlations$.corrMatrix(Correlation.scala:66)
at org.apache.spark.mllib.stat.Statistics$.corr(Statistics.scala:57)
> OOM issue encountered when computing Gramian matrix
> ----------------------------------------------------
>
> Key: SPARK-26228
> URL: https://issues.apache.org/jira/browse/SPARK-26228
> Project: Spark
> Issue Type: Improvement
> Components: MLlib
> Affects Versions: 2.3.0
> Reporter: Chen Lin
> Priority: Major
> Attachments: 1.jpeg
>
>
> {quote}/**
> * Computes the Gramian matrix `A^T A`.
> *
> * @note This cannot be computed on matrices with more than 65535 columns.
> */
> {quote}
> As the above annotation of computeGramianMatrix in RowMatrix.scala said, it
> supports computing on matrices with no more than 65535 columns.
> However, we find that it will throw OOM(Request Array Size Exceeds VM Limit)
> when computing on matrices with 16000 columns.
> The root casue seems that the TreeAggregate writes a very long buffer array
> (16000*16000*8) which exceeds jvm limit(2^31 - 1).
> Does RowMatrix really supports computing on matrices with no more than 65535
> columns?
> I doubt that computeGramianMatrix has a very serious performance issue.
> Do anyone has done some performance expriments before?
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]