[
https://issues.apache.org/jira/browse/SPARK-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14248802#comment-14248802
]
Joseph K. Bradley commented on SPARK-4846:
------------------------------------------
Changing vectorSize sounds too aggressive to me. I'd vote for either the
simple solution (throw a nice error), or an efficient method which chooses
minCount automatically. For the latter, this might work:
1. Try the current method. Catch errors during the collect() for vocab and
during the array allocations. If there is no error, skip step 2.
2. If there is an error, do 1 pass over the data to collect stats (e.g., a
histogram). Use those stats to choose a reasonable minCount. Choose the vocab
again, etc.
3. After the big array allocations, the algorithm can continue as before.
> When the vocabulary size is large, Word2Vec may yield "OutOfMemoryError:
> Requested array size exceeds VM limit"
> ---------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-4846
> URL: https://issues.apache.org/jira/browse/SPARK-4846
> Project: Spark
> Issue Type: Bug
> Components: MLlib
> Affects Versions: 1.1.0
> Environment: Use Word2Vec to process a corpus(sized 3.5G) with one
> partition.
> The corpus contains about 300 million words and its vocabulary size is about
> 10 million.
> Reporter: Joseph Tang
> Priority: Critical
>
> Exception in thread "Driver" java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)
> Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> at java.util.Arrays.copyOf(Arrays.java:2271)
> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
> at
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
> at
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870)
> at
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
> at
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
> at
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
> at
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
> at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
> at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
> at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610)
> at
> org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291)
> at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]