Nathan Conroy created SPARK-47547:
-------------------------------------
Summary: observed false positive rate in bloom filter is greater
than expected for large n
Key: SPARK-47547
URL: https://issues.apache.org/jira/browse/SPARK-47547
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 3.5.0
Reporter: Nathan Conroy
When creating a bloom filter out of a large number of elements (>400 million or
so) with an fpp (false positive rate) of 1% in Spark, the observed false
positive rate appears to be much higher, as much as 20%.
This is demonstrated below in this spark shell:
{noformat}
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.5.0-amzn-0
/_/
Using Scala version 2.12.17 (OpenJDK 64-Bit Server VM, Java 17.0.10)
Type in expressions to have them evaluated.
Type :help for more information.
scala> import java.security.MessageDigest
import java.security.MessageDigest
scala> import scala.util.Random
import scala.util.Random
scala> import org.apache.spark.util.sketch.BloomFilter
import org.apache.spark.util.sketch.BloomFilter
scala>
scala> // Function to generate a random SHA1 hash
scala> def generateRandomSha1(): String = {
| val randomString = Random.alphanumeric.take(20).mkString
| val sha1 = MessageDigest.getInstance("SHA-1")
| sha1.update(randomString.getBytes("UTF-8"))
| val digest = sha1.digest
| digest.map("%02x".format(_)).mkString
| }
generateRandomSha1: ()String
scala>
scala> // Generate a DataFrame with 500 million rows of random SHA1 hashes
scala> val df = spark.range(500000000).map(_ =>
generateRandomSha1()).toDF("Hash")
df: org.apache.spark.sql.DataFrame = [Hash: string]
scala> // Create a bloom filter out of this collection of strings.
scala> val bloom_filter = df.stat.bloomFilter("Hash", 500000000, 0.01)
bloom_filter: org.apache.spark.util.sketch.BloomFilter =
org.apache.spark.util.sketch.BloomFilterImpl@a14c0ba9
scala> // Generate another 10,000 random hashes
scala> val random_sha1s = List.fill(10000)(generateRandomSha1())
random_sha1s: List[String] = List(f3cbfd9bd836ea917ebc0dfc5330135cfde322a3,
4bff8d58799e517a1ba78236db9b52353dd39b56,
775bdd9d138a79eeae7308617f5c0d1d0e1c1697,
abbd761b7768f3cbadbffc0c7947185856c4943d,
343692fe61c552f73ad6bc2d2d3072cc456da1db,
faf4430055c528c9a00a46e9fae7dc25047ffaf3,
255b5d56c39bfba861647fff67704e6bc758d683,
dae8e0910a368f034958ae232aa5f5285486a8ac,
3680dbd34437ca661592a7e4d39782c9c77fb4ba,
f5b43f7a77c9d9ea28101a1848d8b1a1c0a65b82,
5bda825102026bc0da731dc84d56a499ccff0fe1,
158d7b3ce949422de421d5e110e3f6903af4f8e1,
2efcae5cb10273a0f5e89ae34fa3156238ab0555,
8d241012d42097f80f30e8ead227d75ab77086d2,
307495c98ae5f25026b91e60cf51d4f9f1ad7f4b,
8fc2f55563ab67d4ec87ff7b04a4a01e821814a3,
b413572d14ee16c6c575ca3472adff62a8cbfa3d, 9219233b0e8afe57d7d5cb6...
scala> // Check how many of these random hashes return a positive result when
passed into mightContain
scala> random_sha1s.map(c => bloom_filter.mightContain(c)).count(_ == true)
res0: Int = 2153 {noformat}
I believe this is the result of the bloom filter implementation using 32bit
hashes. Since the maximum value that can be returned by the k hash functions is
~2.14 billion (max integer value in Java), bloom filters with m > ~2.14 billion
have degraded performance resulting from not using any bits at indices greater
than ~2.14 billion.
This was a known bug in Guava that was fixed several years ago, but it looks
like the fix was never ported to Spark. See
[https://github.com/google/guava/issues/1119]
Of course, using a different hash function strategy would break existing uses
of this code, so we should tread with caution here.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]