[
https://issues.apache.org/jira/browse/SPARK-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14248262#comment-14248262
]
Joseph Tang commented on SPARK-4846:
------------------------------------
Hi Sean and Joseph.
You are right. I had another test about the 'lazy' and then found my
misunderstanding of the object serialization in JVM. The 'lazy' does not work.
I'll remove the 'lazy' later. Maybe it's the automatic increase of partition
number in my code that helped.
Joseph K. Bradley's good advice about automatic minCount should work, meanwhile
it costs less memory. However, one potential issue may be that in the worst
case we have to try many times to find the best minCount when changing the
minCount step by step.
Also I think a quick fix can be automatically increasing the partition number,
but in this case it has lower accuracy.
Or else, if a Spark user changes one of minCount and numPartitions, we
automatically change the other one. If a user change neither, we just auto
change numPartitions by default. How do you think?
> When the vocabulary size is large, Word2Vec may yield "OutOfMemoryError:
> Requested array size exceeds VM limit"
> ---------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-4846
> URL: https://issues.apache.org/jira/browse/SPARK-4846
> Project: Spark
> Issue Type: Bug
> Components: MLlib
> Affects Versions: 1.1.0
> Environment: Use Word2Vec to process a corpus(sized 3.5G) with one
> partition.
> The corpus contains about 300 million words and its vocabulary size is about
> 10 million.
> Reporter: Joseph Tang
> Priority: Critical
>
> Exception in thread "Driver" java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)
> Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> at java.util.Arrays.copyOf(Arrays.java:2271)
> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
> at
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
> at
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870)
> at
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
> at
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
> at
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
> at
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
> at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
> at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
> at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610)
> at
> org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291)
> at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]