[ https://issues.apache.org/jira/browse/SPARK-32268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166127#comment-17166127 ]
Yuming Wang commented on SPARK-32268: ------------------------------------- A simple benchmark for bloom filter. {code:scala} import org.apache.spark.benchmark.Benchmark import org.apache.spark.util.sketch.BloomFilter val N = 100000000 val items = 1 to 200000 val path = "/tmp/spark/bloomfilter" spark.range(N).write.mode("overwrite").parquet(path) val benchmark = new Benchmark(s"Benchmark bloom filter with ${items.size} items", valuesPerIteration = N, minNumIters = 5) benchmark.addCase("bloom filter") { _ => val br = BloomFilter.create(items.size) items.foreach(br.putLong(_)) spark.read.parquet(path).filter(r => br.mightContainLong(r.getLong(0))).write.format("noop").mode("overwrite").save() } benchmark.addCase("in") { _ => spark.read.parquet(path).filter(s"id in (${items.mkString(", ")})").write.format("noop").mode("overwrite").save() } benchmark.addCase("binary comparison") { _ => spark.read.parquet(path).filter(r => r.getLong(0) <= items.last && r.getLong(0) >= 1).write.format("noop").mode("overwrite").save() } benchmark.addCase("binary comparison and bloom filter") { _ => val br = BloomFilter.create(items.size) items.foreach(br.putLong(_)) spark.read.parquet(path).filter(r => r.getLong(0) <= items.last && r.getLong(0) >= 1 && br.mightContainLong(r.getLong(0))).write.format("noop").mode("overwrite").save() } benchmark.run() {code} {noformat} Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.5 Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz Benchmark bloom filter with 200000 items: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ bloom filter 4057 4139 75 24.6 40.6 1.0X in 47276 51586 953 2.1 472.8 0.1X binary comparison 1911 2086 231 52.3 19.1 2.1X binary comparison and bloom filter 1959 2096 160 51.0 19.6 2.1X {noformat} > Bloom Filter Join > ----------------- > > Key: SPARK-32268 > URL: https://issues.apache.org/jira/browse/SPARK-32268 > Project: Spark > Issue Type: New Feature > Components: SQL > Affects Versions: 3.1.0 > Reporter: Yuming Wang > Assignee: Yuming Wang > Priority: Major > Attachments: q16-bloom-filter.jpg, q16-default.jpg > > > We can improve the performance of some joins by pre-filtering one side of a > join using a Bloom filter and IN predicate generated from the values from the > other side of the join. > For > example:[tpcds/q16.sql|https://github.com/apache/spark/blob/a78d6ce376edf2a8836e01f47b9dff5371058d4c/sql/core/src/test/resources/tpcds/q16.sql]. > [Before this > optimization|https://issues.apache.org/jira/secure/attachment/13007418/q16-default.jpg]. > [After this > optimization|https://issues.apache.org/jira/secure/attachment/13007416/q16-bloom-filter.jpg]. > *Query Performance Benchmarks: TPC-DS Performance Evaluation* > Our setup for running TPC-DS benchmark was as follows: TPC-DS 5T and > Partitioned Parquet table > > |Query|Default(Seconds)|Enable Bloom Filter Join(Seconds)| > |tpcds q16|84|46| > |tpcds q36|29|21| > |tpcds q57|39|28| > |tpcds q94|42|34| > |tpcds q95|306|288| -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org