[
https://issues.apache.org/jira/browse/SPARK-27069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
TAESUK KIM updated SPARK-27069:
-------------------------------
Summary: Spark(2.3.1) LDA transfomation memory
error(java.lang.OutOfMemoryError at
java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:1232
(was: Spark(2.3.1) LDA transfomation memory error(java.lang.OutOfMemoryError at
java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123))
> Spark(2.3.1) LDA transfomation memory error(java.lang.OutOfMemoryError at
> java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:1232
> ----------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-27069
> URL: https://issues.apache.org/jira/browse/SPARK-27069
> Project: Spark
> Issue Type: Bug
> Components: ML
> Affects Versions: 2.3.2
> Environment: Below is my environment
> DataSet
> # Document : about 100,000,000 --> 10,000,000 --> 1,000,000(All fail)
> # Word : about 3553918(can't change)
> Spark environment
> # executor-memory,driver-memory : 18G --> 32g --> 64 --> 128g(all fail)
> # executor-core,driver-core : 3
> # spark.serializer : default and
> org.apache.spark.serializer.KryoSerializer(both fail)
> # spark.executor.memoryOverhead : 18G --> 36G fail
> Jave version : 1.8.0_191 (Oracle Corporation)
>
> Reporter: TAESUK KIM
> Priority: Major
>
> I trained LDA(feature dimension : 100, iteration: 100 or 50, Distributed
> version , ml ) using Spark 2.3.2(emr-5.18.0) .
> After that I want to transform new DataSet by using that model. But when I
> transform new data, I alway get error related memory error.
> I changed data size from x 0.1 , to x 0.01. But always get memory
> error(java.lang.OutOfMemoryError at
> java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
>
> That hugeCapacity error(overflow) is happened when size of array is over
> Integer.MAX_VALUE - 8. But I changed data size to small size. I can't find
> why this error is happened.
> And I want to change serializer to KryoSerializer. But I found
> this org.apache.spark.util.ClosureCleaner$.ensureSerializable always call
> org.apache.spark.serializer.JavaSerializationStream even though I register
> KryoClasses
>
> Is there any thing I can do ?
>
> Below is code
>
> {{val countvModel = CountVectorizerModel.load("s3://~/") }}
> {{val ldaModel = DistributedLDAModel.load("s3://~/") }}
> {{val transformeddata=countvModel.transform(inputData).select("productid",
> "itemid", "ptkString", "features") var featureldaDF =
> ldaModel.transform(transformeddata).select("productid", "itemid",
> "topicDistribution", "ptkString").toDF("productid", "itemid", "features",
> "ptkString") featureldaDF=featureldaDF.persist //this is 328 line }}
>
> Other testing
> # Java option : UseParallelGC , UseG1GC (all fail)
> Below is log
> {{19/03/05 20:59:03 ERROR ApplicationMaster: User class threw exception:
> java.lang.OutOfMemoryError java.lang.OutOfMemoryError at
> java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) at
> java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) at
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) at
> org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
> at
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
> at
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) at
> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
> at
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
> at
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:342)
> at
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:335)
> at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159) at
> org.apache.spark.SparkContext.clean(SparkContext.scala:2299) at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:850)
> at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:849)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at
> org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:849) at
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:608)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at
> org.apache.spark.sql.execution.columnar.InMemoryRelation.buildBuffers(InMemoryRelation.scala:107)
> at
> org.apache.spark.sql.execution.columnar.InMemoryRelation.<init>(InMemoryRelation.scala:102)
> at
> org.apache.spark.sql.execution.columnar.InMemoryRelation$.apply(InMemoryRelation.scala:43)
> at
> org.apache.spark.sql.execution.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:97)
> at
> org.apache.spark.sql.execution.CacheManager.writeLock(CacheManager.scala:67)
> at
> org.apache.spark.sql.execution.CacheManager.cacheQuery(CacheManager.scala:91)
> at org.apache.spark.sql.Dataset.persist(Dataset.scala:2907) at
> coupang.cs.predictforxgboost.App$.main(App.scala:328) at
> coupang.cs.predictforxgboost.App.main(App.scala) at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498) at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:721)
> }}
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]