[ https://issues.apache.org/jira/browse/SPARK-21039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16045650#comment-16045650 ]
Apache Spark commented on SPARK-21039: -------------------------------------- User 'rishabhbhardwaj' has created a pull request for this issue: https://github.com/apache/spark/pull/18263 > Use treeAggregate instead of aggregate in DataFrame.stat.bloomFilter > -------------------------------------------------------------------- > > Key: SPARK-21039 > URL: https://issues.apache.org/jira/browse/SPARK-21039 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 2.1.1 > Reporter: Lovasoa > > Currently, DataFrame.stat.bloomFilter uses RDD.aggregate, which means that > the bloom filters received for each partition of data are merged in the > driver. The cost of this operation can be very high if the bloom filters are > large. It would be nice if it used RDD.treeAggregate instead, in order to > parallelize the operation of merging the bloom filters. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org