[ 
https://issues.apache.org/jira/browse/SPARK-35848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17384353#comment-17384353
 ] 

Sean R. Owen commented on SPARK-35848:
--------------------------------------

I don't know that the serializer would help - this isn't hitting a 2GB limit 
(although nearly does). That would be a different error. This is going to need 
a lot of memory no matter what, at this scale - almost 2GB per copy of the 
filter. How much memory do you have? 

Yeah I think a similar change could be applied that would save some memory, but 
wouldn't in the end save this from running out of memory with a big enough 
filter and little enough memory. Let me try a pull request.

> Spark Bloom Filter throws OutOfMemoryError
> ------------------------------------------
>
>                 Key: SPARK-35848
>                 URL: https://issues.apache.org/jira/browse/SPARK-35848
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.0.0, 3.0.0
>            Reporter: Sai Polisetty
>            Priority: Minor
>
> When the Bloom filter stat function is invoked on a large dataframe that 
> requires a BitArray of size >2GB, it will result in a 
> {color:#555555}java.lang.OutOfMemoryError{color}. As mentioned in a similar 
> bug, this is due to the zero value passed to treeAggrete. Irrespective of 
> spark.serializer value, this will be serialized using JavaSerializer which 
> has a hard limit of 2GB. Using a solution similar to SPARK-26228 and setting 
> spark.serializer to KryoSerializer can avoid this error.
>  
> Steps to reproduce:
> {{val df = List.range(0, 10).toDF("Id")}}{{val expectedNumItems = 2000000000L 
> // 2 billion}}
> {{val fpp = 0.03}}
> {{val bf = df.stat.bloomFilter("Id", expectedNumItems, fpp)}}
> Stack trace:
> {color:#555555}java.lang.OutOfMemoryError{color}
> {color:#555555} at 
> java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) at 
> java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) 
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) at 
> org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
>  at 
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
>  at 
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) at 
> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
>  at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
>  at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:413)
>  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:406) at 
> org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162) at 
> org.apache.spark.SparkContext.clean(SparkContext.scala:2604) at 
> org.apache.spark.rdd.PairRDDFunctions.$anonfun$combineByKeyWithClassTag$1(PairRDDFunctions.scala:86)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:125)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:395) at 
> org.apache.spark.rdd.PairRDDFunctions.combineByKeyWithClassTag(PairRDDFunctions.scala:75)
>  at 
> org.apache.spark.rdd.PairRDDFunctions.$anonfun$foldByKey$1(PairRDDFunctions.scala:218)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:125)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:395) at 
> org.apache.spark.rdd.PairRDDFunctions.foldByKey(PairRDDFunctions.scala:207) 
> at org.apache.spark.rdd.RDD.$anonfun$treeAggregate$1(RDD.scala:1224) at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:125)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:395) at 
> org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1203) at 
> org.apache.spark.sql.DataFrameStatFunctions.buildBloomFilter(DataFrameStatFunctions.scala:602)
>  at 
> org.apache.spark.sql.DataFrameStatFunctions.bloomFilter(DataFrameStatFunctions.scala:541){color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to