[ 
https://issues.apache.org/jira/browse/SPARK-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14248802#comment-14248802
 ] 

Joseph K. Bradley commented on SPARK-4846:
------------------------------------------

Changing vectorSize sounds too aggressive to me.  I'd vote for either the 
simple solution (throw a nice error), or an efficient method which chooses 
minCount automatically.  For the latter, this might work:

1. Try the current method.  Catch errors during the collect() for vocab and 
during the array allocations.  If there is no error, skip step 2.
2. If there is an error, do 1 pass over the data to collect stats (e.g., a 
histogram).  Use those stats to choose a reasonable minCount.  Choose the vocab 
again, etc.
3. After the big array allocations, the algorithm can continue as before.


> When the vocabulary size is large, Word2Vec may yield "OutOfMemoryError: 
> Requested array size exceeds VM limit"
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-4846
>                 URL: https://issues.apache.org/jira/browse/SPARK-4846
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 1.1.0
>         Environment: Use Word2Vec to process a corpus(sized 3.5G) with one 
> partition.
> The corpus contains about 300 million words and its vocabulary size is about 
> 10 million.
>            Reporter: Joseph Tang
>            Priority: Critical
>
> Exception in thread "Driver" java.lang.reflect.InvocationTargetException
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:606)
>     at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)
> Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit 
>     at java.util.Arrays.copyOf(Arrays.java:2271)
>     at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
>     at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
>     at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
>     at 
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870)
>     at 
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779)
>     at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186)
>     at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
>     at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
>     at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
>     at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
>     at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
>     at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
>     at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610)
>     at 
> org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291)
>     at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
>     at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to