yaooqinn opened a new pull request, #56114: URL: https://github.com/apache/spark/pull/56114
### What changes were proposed in this pull request? Add `@note` Scaladoc to all four `DataFrameStatFunctions.bloomFilter` overloads and a Migration Guide entry stating that bloom filters built over string columns use raw UTF-8 byte equality and are collation-blind for non-binary collations (UTF8_LCASE, ICU). For collation-consistent membership, the docs recommend using a UTF8_BINARY-collated column, or normalizing values manually (e.g. `lower(col)` for ASCII data under UTF8_LCASE). ### Why are the changes needed? Since Spark 4.0, columns may carry non-binary collations. The bloom filter path does not collation-normalize values, so users get silent inconsistent `mightContain` results without any documented warning. See [SPARK-57055](https://issues.apache.org/jira/browse/SPARK-57055). ### Does this PR introduce _any_ user-facing change? Documentation only. ### How was this patch tested? Scaladoc renders cleanly via `build/sbt "sql-api/doc"` (49s, no new warnings). Behavior reproduced on master via `spark-shell`: ```scala // Sample 1: UTF8_BINARY (default) — collation-consistent val df1 = Seq("Alice", "Bob", "Carol").toDF("name") val bf1 = df1.stat.bloomFilter("name", 100L, 0.01) bf1.mightContain("Alice") // true bf1.mightContain("alice") // false // Sample 2: UTF8_LCASE raw — silent gap documented by this PR val df2 = spark.sql( "SELECT 'Alice' COLLATE UTF8_LCASE AS name UNION ALL " + "SELECT 'Bob' COLLATE UTF8_LCASE UNION ALL " + "SELECT 'Carol' COLLATE UTF8_LCASE") val bf2 = df2.stat.bloomFilter("name", 100L, 0.01) bf2.mightContain("Alice") // true bf2.mightContain("alice") // false <-- silent gap // Sample 3: UTF8_LCASE + lower() ASCII work-around recommended by this PR import org.apache.spark.sql.functions.lower val bf3 = df2.stat.bloomFilter(lower(df2("name")), 100L, 0.01) bf3.mightContain("alice") // true (user must wrap probe side too) bf3.mightContain("Alice") // false ``` No new unit tests — this is a documentation-only change with no behavior to assert. ### Scope note PySpark / R / Connect docstring synchronization is intentionally deferred to a follow-up PR to keep this change docs-only and minimal. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
