[PR] [SPARK-57055][SQL][DOCS] Document non-binary collation gap in DataFrameStatFunctions.bloomFilter [spark]

via GitHub Tue, 26 May 2026 01:09:30 -0700


yaooqinn opened a new pull request, #56114:
URL: https://github.com/apache/spark/pull/56114


   ### What changes were proposed in this pull request?
   
   Add `@note` Scaladoc to all four `DataFrameStatFunctions.bloomFilter`
   overloads and a Migration Guide entry stating that bloom filters built
   over string columns use raw UTF-8 byte equality and are collation-blind
   for non-binary collations (UTF8_LCASE, ICU). For collation-consistent
   membership, the docs recommend using a UTF8_BINARY-collated column, or
   normalizing values manually (e.g. `lower(col)` for ASCII data under
   UTF8_LCASE).
   
   ### Why are the changes needed?
   
   Since Spark 4.0, columns may carry non-binary collations. The bloom
   filter path does not collation-normalize values, so users get silent
   inconsistent `mightContain` results without any documented warning.
   See [SPARK-57055](https://issues.apache.org/jira/browse/SPARK-57055).
   
   ### Does this PR introduce _any_ user-facing change?
   
   Documentation only.
   
   ### How was this patch tested?
   
   Scaladoc renders cleanly via `build/sbt "sql-api/doc"` (49s, no new
   warnings).
   
   Behavior reproduced on master via `spark-shell`:
   
   ```scala
   // Sample 1: UTF8_BINARY (default) — collation-consistent
   val df1 = Seq("Alice", "Bob", "Carol").toDF("name")
   val bf1 = df1.stat.bloomFilter("name", 100L, 0.01)
   bf1.mightContain("Alice")  // true
   bf1.mightContain("alice")  // false
   
   // Sample 2: UTF8_LCASE raw — silent gap documented by this PR
   val df2 = spark.sql(
     "SELECT 'Alice' COLLATE UTF8_LCASE AS name UNION ALL " +
     "SELECT 'Bob' COLLATE UTF8_LCASE UNION ALL " +
     "SELECT 'Carol' COLLATE UTF8_LCASE")
   val bf2 = df2.stat.bloomFilter("name", 100L, 0.01)
   bf2.mightContain("Alice")  // true
   bf2.mightContain("alice")  // false  <-- silent gap
   
   // Sample 3: UTF8_LCASE + lower() ASCII work-around recommended by this PR
   import org.apache.spark.sql.functions.lower
   val bf3 = df2.stat.bloomFilter(lower(df2("name")), 100L, 0.01)
   bf3.mightContain("alice")  // true   (user must wrap probe side too)
   bf3.mightContain("Alice")  // false
   ```
   
   No new unit tests — this is a documentation-only change with no
   behavior to assert.
   
   ### Scope note
   
   PySpark / R / Connect docstring synchronization is intentionally
   deferred to a follow-up PR to keep this change docs-only and minimal.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-57055][SQL][DOCS] Document non-binary collation gap in DataFrameStatFunctions.bloomFilter [spark]

Reply via email to