[ 
https://issues.apache.org/jira/browse/SPARK-35848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17384540#comment-17384540
 ] 

Sean R. Owen commented on SPARK-35848:
--------------------------------------

I think you'd get "Requested array size exceeds VM limit" if you were exceeding 
the 2GB limit; it should be close in this case (~1.8GB serialized according to 
the size of the bitmap it would allocate at 2B elements, 0.03 FPP) and could 
somehow be bigger, I suppose.

I don't think there is a general fix for OOM, as you can cause this to allocate 
100GB of bloom filter with enough elements, for example. The change I'm copying 
just avoids one instance of the copying, not all of them. I haven't thought it 
through, but if the copy that's avoided (copying 'zero' from driver to workers) 
is the only thing that goes through the JavaSerializer, it might avoid at least 
the 2GB limit, but I'm not sure.

It's an improvement in any event, one that's at least easy to make.

> Spark Bloom Filter, others using treeAggregate can throw OutOfMemoryError
> -------------------------------------------------------------------------
>
>                 Key: SPARK-35848
>                 URL: https://issues.apache.org/jira/browse/SPARK-35848
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, Spark Core
>    Affects Versions: 3.0.3, 3.1.2, 3.2.0
>            Reporter: Sai Polisetty
>            Assignee: Sean R. Owen
>            Priority: Minor
>
> When the Bloom filter stat function is invoked on a large dataframe that 
> requires a BitArray of size >2GB, it will result in a 
> {color:#555555}java.lang.OutOfMemoryError{color}. As mentioned in a similar 
> bug, this is due to the zero value passed to treeAggrete. Irrespective of 
> spark.serializer value, this will be serialized using JavaSerializer which 
> has a hard limit of 2GB. Using a solution similar to SPARK-26228 and setting 
> spark.serializer to KryoSerializer can avoid this error.
>  
> Steps to reproduce:
> {{val df = List.range(0, 10).toDF("Id")}}{{val expectedNumItems = 2000000000L 
> // 2 billion}}
> {{val fpp = 0.03}}
> {{val bf = df.stat.bloomFilter("Id", expectedNumItems, fpp)}}
> Stack trace:
> {color:#555555}java.lang.OutOfMemoryError{color}
> {color:#555555} at 
> java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) at 
> java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) 
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) at 
> org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
>  at 
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
>  at 
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) at 
> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
>  at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
>  at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:413)
>  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:406) at 
> org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162) at 
> org.apache.spark.SparkContext.clean(SparkContext.scala:2604) at 
> org.apache.spark.rdd.PairRDDFunctions.$anonfun$combineByKeyWithClassTag$1(PairRDDFunctions.scala:86)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:125)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:395) at 
> org.apache.spark.rdd.PairRDDFunctions.combineByKeyWithClassTag(PairRDDFunctions.scala:75)
>  at 
> org.apache.spark.rdd.PairRDDFunctions.$anonfun$foldByKey$1(PairRDDFunctions.scala:218)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:125)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:395) at 
> org.apache.spark.rdd.PairRDDFunctions.foldByKey(PairRDDFunctions.scala:207) 
> at org.apache.spark.rdd.RDD.$anonfun$treeAggregate$1(RDD.scala:1224) at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:125)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:395) at 
> org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1203) at 
> org.apache.spark.sql.DataFrameStatFunctions.buildBloomFilter(DataFrameStatFunctions.scala:602)
>  at 
> org.apache.spark.sql.DataFrameStatFunctions.bloomFilter(DataFrameStatFunctions.scala:541){color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to