[ https://issues.apache.org/jira/browse/SPARK-47547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17952695#comment-17952695 ]
Ish Nagy commented on SPARK-47547: ---------------------------------- Hi all, I'm working on [PR#50933|https://github.com/apache/spark/pull/50933], which should address the 32bit index truncation issue, without completely replacing the currently used MurMur3 implementation. > observed false positive rate in bloom filter is greater than expected for > large n > --------------------------------------------------------------------------------- > > Key: SPARK-47547 > URL: https://issues.apache.org/jira/browse/SPARK-47547 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 3.5.0 > Reporter: Nathan Conroy > Priority: Minor > Labels: pull-request-available > > When creating a bloom filter out of a large number of elements (>400 million > or so) with an fpp (false positive rate) of 1% in Spark, the observed false > positive rate appears to be much higher, as much as 20%. > This is demonstrated below in this spark shell: > {noformat} > ____ __ > / __/__ ___ _____/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 3.5.0-amzn-0 > /_/ > > Using Scala version 2.12.17 (OpenJDK 64-Bit Server VM, Java 17.0.10) > Type in expressions to have them evaluated. > Type :help for more information. > scala> import java.security.MessageDigest > import java.security.MessageDigest > scala> import scala.util.Random > import scala.util.Random > scala> import org.apache.spark.util.sketch.BloomFilter > import org.apache.spark.util.sketch.BloomFilter > scala> > scala> // Function to generate a random SHA1 hash > scala> def generateRandomSha1(): String = { > | val randomString = Random.alphanumeric.take(20).mkString > | val sha1 = MessageDigest.getInstance("SHA-1") > | sha1.update(randomString.getBytes("UTF-8")) > | val digest = sha1.digest > | digest.map("%02x".format(_)).mkString > | } > generateRandomSha1: ()String > scala> > scala> // Generate a DataFrame with 500 million rows of random SHA1 hashes > scala> val df = spark.range(500000000).map(_ => > generateRandomSha1()).toDF("Hash") > df: org.apache.spark.sql.DataFrame = [Hash: string] > scala> // Create a bloom filter out of this collection of strings. > scala> val bloom_filter = df.stat.bloomFilter("Hash", 500000000, 0.01) > bloom_filter: org.apache.spark.util.sketch.BloomFilter = > org.apache.spark.util.sketch.BloomFilterImpl@a14c0ba9 > scala> // Generate another 10,000 random hashes > scala> val random_sha1s = List.fill(10000)(generateRandomSha1()) > random_sha1s: List[String] = List(f3cbfd9bd836ea917ebc0dfc5330135cfde322a3, > 4bff8d58799e517a1ba78236db9b52353dd39b56, > 775bdd9d138a79eeae7308617f5c0d1d0e1c1697, > abbd761b7768f3cbadbffc0c7947185856c4943d, > 343692fe61c552f73ad6bc2d2d3072cc456da1db, > faf4430055c528c9a00a46e9fae7dc25047ffaf3, > 255b5d56c39bfba861647fff67704e6bc758d683, > dae8e0910a368f034958ae232aa5f5285486a8ac, > 3680dbd34437ca661592a7e4d39782c9c77fb4ba, > f5b43f7a77c9d9ea28101a1848d8b1a1c0a65b82, > 5bda825102026bc0da731dc84d56a499ccff0fe1, > 158d7b3ce949422de421d5e110e3f6903af4f8e1, > 2efcae5cb10273a0f5e89ae34fa3156238ab0555, > 8d241012d42097f80f30e8ead227d75ab77086d2, > 307495c98ae5f25026b91e60cf51d4f9f1ad7f4b, > 8fc2f55563ab67d4ec87ff7b04a4a01e821814a3, > b413572d14ee16c6c575ca3472adff62a8cbfa3d, 9219233b0e8afe57d7d5cb6... > scala> // Check how many of these random hashes return a positive result when > passed into mightContain > scala> random_sha1s.map(c => bloom_filter.mightContain(c)).count(_ == true) > res0: Int = 2153 {noformat} > I believe this is the result of the bloom filter implementation using 32bit > hashes. Since the maximum value that can be returned by the k hash functions > is ~2.14 billion (max integer value in Java), bloom filters with m > ~2.14 > billion have degraded performance resulting from not using any bits at > indices greater than ~2.14 billion. > This was a known bug in Guava that was fixed several years ago, but it looks > like the fix was never ported to Spark. See > [https://github.com/google/guava/issues/1119] > Of course, using a different hash function strategy would break existing uses > of this code, so we should tread with caution here. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org