Nathan Conroy created SPARK-47547:
-------------------------------------

             Summary: observed false positive rate in bloom filter is greater 
than expected for large n
                 Key: SPARK-47547
                 URL: https://issues.apache.org/jira/browse/SPARK-47547
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 3.5.0
            Reporter: Nathan Conroy


When creating a bloom filter out of a large number of elements (>400 million or 
so) with an fpp (false positive rate) of 1% in Spark, the observed false 
positive rate appears to be much higher, as much as 20%.

This is demonstrated below in this spark shell:


{noformat}
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.5.0-amzn-0
      /_/
         
Using Scala version 2.12.17 (OpenJDK 64-Bit Server VM, Java 17.0.10)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import java.security.MessageDigest
import java.security.MessageDigest

scala> import scala.util.Random
import scala.util.Random

scala> import org.apache.spark.util.sketch.BloomFilter
import org.apache.spark.util.sketch.BloomFilter

scala> 

scala> // Function to generate a random SHA1 hash

scala> def generateRandomSha1(): String = {
     |   val randomString = Random.alphanumeric.take(20).mkString
     |   val sha1 = MessageDigest.getInstance("SHA-1")
     |   sha1.update(randomString.getBytes("UTF-8"))
     |   val digest = sha1.digest
     |   digest.map("%02x".format(_)).mkString
     | }
generateRandomSha1: ()String

scala> 

scala> // Generate a DataFrame with 500 million rows of random SHA1 hashes

scala> val df = spark.range(500000000).map(_ => 
generateRandomSha1()).toDF("Hash")
df: org.apache.spark.sql.DataFrame = [Hash: string]

scala> // Create a bloom filter out of this collection of strings.

scala> val bloom_filter = df.stat.bloomFilter("Hash", 500000000, 0.01)
bloom_filter: org.apache.spark.util.sketch.BloomFilter = 
org.apache.spark.util.sketch.BloomFilterImpl@a14c0ba9

scala> // Generate another 10,000 random hashes

scala> val random_sha1s = List.fill(10000)(generateRandomSha1())
random_sha1s: List[String] = List(f3cbfd9bd836ea917ebc0dfc5330135cfde322a3, 
4bff8d58799e517a1ba78236db9b52353dd39b56, 
775bdd9d138a79eeae7308617f5c0d1d0e1c1697, 
abbd761b7768f3cbadbffc0c7947185856c4943d, 
343692fe61c552f73ad6bc2d2d3072cc456da1db, 
faf4430055c528c9a00a46e9fae7dc25047ffaf3, 
255b5d56c39bfba861647fff67704e6bc758d683, 
dae8e0910a368f034958ae232aa5f5285486a8ac, 
3680dbd34437ca661592a7e4d39782c9c77fb4ba, 
f5b43f7a77c9d9ea28101a1848d8b1a1c0a65b82, 
5bda825102026bc0da731dc84d56a499ccff0fe1, 
158d7b3ce949422de421d5e110e3f6903af4f8e1, 
2efcae5cb10273a0f5e89ae34fa3156238ab0555, 
8d241012d42097f80f30e8ead227d75ab77086d2, 
307495c98ae5f25026b91e60cf51d4f9f1ad7f4b, 
8fc2f55563ab67d4ec87ff7b04a4a01e821814a3, 
b413572d14ee16c6c575ca3472adff62a8cbfa3d, 9219233b0e8afe57d7d5cb6...

scala> // Check how many of these random hashes return a positive result when 
passed into mightContain

scala> random_sha1s.map(c => bloom_filter.mightContain(c)).count(_ == true)
res0: Int = 2153 {noformat}
I believe this is the result of the bloom filter implementation using 32bit 
hashes. Since the maximum value that can be returned by the k hash functions is 
~2.14 billion (max integer value in Java), bloom filters with m > ~2.14 billion 
have degraded performance resulting from not using any bits at indices greater 
than ~2.14 billion. 

This was a known bug in Guava that was fixed several years ago, but it looks 
like the fix was never ported to Spark. See 
[https://github.com/google/guava/issues/1119]

Of course, using a different hash function strategy would break existing uses 
of this code, so we should tread with caution here. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to