[ https://issues.apache.org/jira/browse/SPARK-21039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-21039: ------------------------------------ Assignee: Apache Spark > Use treeAggregate instead of aggregate in DataFrame.stat.bloomFilter > -------------------------------------------------------------------- > > Key: SPARK-21039 > URL: https://issues.apache.org/jira/browse/SPARK-21039 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 2.1.1 > Reporter: Lovasoa > Assignee: Apache Spark > > Currently, DataFrame.stat.bloomFilter uses RDD.aggregate, which means that > the bloom filters received for each partition of data are merged in the > driver. The cost of this operation can be very high if the bloom filters are > large. It would be nice if it used RDD.treeAggregate instead, in order to > parallelize the operation of merging the bloom filters. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org