[jira] [Commented] (SPARK-57055) DataFrameStatFunctions.bloomFilter is collation-blind on non-binary collation columns

Kent Yao (Jira) Tue, 26 May 2026 01:10:29 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-57055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18083433#comment-18083433
 ]


Kent Yao commented on SPARK-57055:
----------------------------------

Opened a docs-only PR for the path-2 silent gap: 
https://github.com/apache/spark/pull/56114

The PR adds @note Scaladoc to all four DataFrameStatFunctions.bloomFilter 
overloads + a Migration Guide entry in 'Upgrading from Spark SQL 3.5 to 4.0'. 
It documents the binary-semantics behaviour and recommends UTF8_BINARY 
collation column or lower(col) ASCII work-around. ICU-collation users are 
pointed back here for the public canonical-form helper discussion.

> DataFrameStatFunctions.bloomFilter is collation-blind on non-binary collation 
> columns
> -------------------------------------------------------------------------------------
>
>                 Key: SPARK-57055
>                 URL: https://issues.apache.org/jira/browse/SPARK-57055
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 4.1.0, 4.0.0, 5.0.0
>            Reporter: Kent Yao
>            Priority: Minor
>              Labels: pull-request-available
>
> h3. Retraction of original report (2026-05-26)
> This ticket was originally filed (2026-05-26) as a Bug claiming
> {{InjectRuntimeFilter}} produces silent false negatives on non-binary
> collation StringType join keys. *That claim was wrong* and is hereby
> retracted.
> Empirical disproof: a fresh end-to-end test
> ({{InjectRuntimeFilterCollationSuite}}, 3 cases: UTF8_LCASE hit /
> UNICODE_CI hit / UTF8_BINARY no-op) PASSES on master @ {{29fdcefb8f2}}
> with no fix applied. Verbatim optimized plan dump:
> {noformat}
> Project [v#7]
> +- Join Inner, (collationkey(k#6) = collationkey(k#8))
>    :- Filter might_contain(scalar-subquery#14 [],
>                             xxhash64(collationkey(k#6), 42))
>    :  :  +- Aggregate [bloom_filter_agg(
>    :  :                  xxhash64(collationkey(k#8), 42), ...) AS 
> bloomFilter#13]
>    :  :     +- Project [k#8]
>    :  :        +- Filter (isnotnull(tag#9) AND (tag#9 = 1))
>    :  :           +- Relation spark_catalog.default.bf_l[k#8,tag#9] parquet
> {noformat}
> Root cause: {{RewriteCollationJoin}} analyzer rule
> ({{sql/catalyst/.../analysis/RewriteCollationJoin.scala}}, introduced
> by SPARK-48000 / {{e6236af3d08}}) already wraps non-binary collated
> join keys with {{CollationKey(expr)}} *before* the optimizer runs.
> {{InjectRuntimeFilter}} therefore sees an already-wrapped key and
> emits {{xxhash64(collationkey(k), 42)}} on both build and probe sides
> symmetrically. The BloomFilter contract holds by construction.
> The original reproducer measured a hand-written SQL shape
> ({{xxhash64(name)}} composed by the user) that {{InjectRuntimeFilter}}
> never produces. Failing test != failing production path; I conflated
> the two. Apologies for the noise.
> ----
> h3. Reshape: this ticket now tracks a different, narrower question
> Reclassified to Improvement / Minor.
> h4. Topic — DataFrameStatFunctions.bloomFilter is collation-blind
> {{DataFrameStatFunctions.bloomFilter(col, expectedNumItems, numBits)}}
> (public API since 2.0) is collation-blind. When invoked over a column
> whose schema collation is non-binary (e.g. UTF8_LCASE), the resulting
> {{org.apache.spark.util.sketch.BloomFilter}} treats values byte-wise:
> {{mightContainString("alice")}} returns {{false}} after building over
> {{"Alice"}} under UTF8_LCASE, even though the column's own {{=}}
> semantics would consider the two equal.
> Empirical evidence (test {{BloomFilterPath2ReproSuite}} on master @
> {{29fdcefb8f2}}):
> {noformat}
> [PATH2-SCHEMA]   name: string collate UTF8_LCASE
> [PATH2-OUTCOME]  BUILT-OK | mightContain('Alice')=true
>                           | mightContain('alice')=false
>                           | mightContain('Bob')=true
>                           | mightContain('bob')=false
> [PATH2-BASELINE] binary col: 'Alice'=true 'alice'=false
> {noformat}
> The UTF8_LCASE column behaves byte-for-byte identically to the
> UTF8_BINARY baseline -- the API ignores collation.
> h4. Strict contract analysis
> The returned object is {{org.apache.spark.util.sketch.BloomFilter}}, a
> plain Java sketch. Its {{mightContainString(String)}} method has never
> promised collation-aware semantics; it operates on raw UTF-8 bytes.
> Strictly speaking, no contract is violated.
> However, from a Spark SQL user's perspective there is a real semantic
> gap: {{WHERE name = 'alice'}} on a UTF8_LCASE column matches {{Alice}},
> but the BloomFilter built from the same column via this API does not.
> The mismatch is silent (no exception, no warning) and easy to miss.
> h4. Possible directions (not committed)
> Listed for discussion; ticket scope is "decide a direction", not
> "prescribe one".
> * {*}Reject at API boundary{*}: throw
>   {{IllegalArgumentException}} when called on a non-UTF8_BINARY
>   collation column. Smallest behavioural change; would break any
>   existing user (if any) silently relying on the byte-wise semantics.
> * {*}Wrap with CollationKey internally{*}: at the API call site,
>   rewrite the column to {{CollationKey(col)}} before invoking
>   {{bloom_filter_agg}}. Build side becomes collation-aware. Probe
>   side ({{BloomFilter.mightContainString}}) would also need a
>   wrapper, ideally a {{CollationAwareBloomFilter}} subclass that
>   applies {{getCollationKey}} on each probe. Larger surface change;
>   serialization compatibility and PySpark parity need separate
>   thought.
> * {*}Documentation only{*}: clarify in scaladoc that
>   {{DataFrameStatFunctions.bloomFilter}} is byte-wise regardless of
>   column collation, and recommend explicit
>   {{collation_key(col)}} composition. Smallest scope, leaves the
>   silent gap in place.
> h4. Reproducer
> Test file
> {{sql/core/src/test/scala/org/apache/spark/sql/BloomFilterPath2ReproSuite.scala}}
> (local at this reporter's worktree; can be contributed if there is
> appetite for direction (a) or (b)).
> ----
> h3. Out of scope for this ticket
> * {{InjectRuntimeFilter}} / runtime bloom filter on join keys --
>   already correct via {{RewriteCollationJoin}} (see retraction
>   above).
> * User-written {{bloom_filter_agg(xxhash64(string_col))}} +
>   {{might_contain(bf, xxhash64('literal'))}} -- a user is hashing
>   manually; the result depends on what the user wrote.
> * {{BloomFilterAggregate.BinaryUpdater}} -- reachable only via the
>   same Java-API path above; covered if that path is fixed.
> ----
> Reporter: Kent Yao
> Affects Versions: 4.0.0, 4.1.0, 5.0.0
> Component: SQL



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-57055) DataFrameStatFunctions.bloomFilter is collation-blind on non-binary collation columns

Reply via email to