[ https://issues.apache.org/jira/browse/SPARK-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14248802#comment-14248802 ]
Joseph K. Bradley commented on SPARK-4846: ------------------------------------------ Changing vectorSize sounds too aggressive to me. I'd vote for either the simple solution (throw a nice error), or an efficient method which chooses minCount automatically. For the latter, this might work: 1. Try the current method. Catch errors during the collect() for vocab and during the array allocations. If there is no error, skip step 2. 2. If there is an error, do 1 pass over the data to collect stats (e.g., a histogram). Use those stats to choose a reasonable minCount. Choose the vocab again, etc. 3. After the big array allocations, the algorithm can continue as before. > When the vocabulary size is large, Word2Vec may yield "OutOfMemoryError: > Requested array size exceeds VM limit" > --------------------------------------------------------------------------------------------------------------- > > Key: SPARK-4846 > URL: https://issues.apache.org/jira/browse/SPARK-4846 > Project: Spark > Issue Type: Bug > Components: MLlib > Affects Versions: 1.1.0 > Environment: Use Word2Vec to process a corpus(sized 3.5G) with one > partition. > The corpus contains about 300 million words and its vocabulary size is about > 10 million. > Reporter: Joseph Tang > Priority: Critical > > Exception in thread "Driver" java.lang.reflect.InvocationTargetException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162) > Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit > at java.util.Arrays.copyOf(Arrays.java:2271) > at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) > at > java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) > at > java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870) > at > java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73) > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) > at org.apache.spark.SparkContext.clean(SparkContext.scala:1242) > at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610) > at > org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) > at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org