[
https://issues.apache.org/jira/browse/SPARK-57055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18083433#comment-18083433
]
Kent Yao commented on SPARK-57055:
----------------------------------
Opened a docs-only PR for the path-2 silent gap:
https://github.com/apache/spark/pull/56114
The PR adds @note Scaladoc to all four DataFrameStatFunctions.bloomFilter
overloads + a Migration Guide entry in 'Upgrading from Spark SQL 3.5 to 4.0'.
It documents the binary-semantics behaviour and recommends UTF8_BINARY
collation column or lower(col) ASCII work-around. ICU-collation users are
pointed back here for the public canonical-form helper discussion.
> DataFrameStatFunctions.bloomFilter is collation-blind on non-binary collation
> columns
> -------------------------------------------------------------------------------------
>
> Key: SPARK-57055
> URL: https://issues.apache.org/jira/browse/SPARK-57055
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 4.1.0, 4.0.0, 5.0.0
> Reporter: Kent Yao
> Priority: Minor
> Labels: pull-request-available
>
> h3. Retraction of original report (2026-05-26)
> This ticket was originally filed (2026-05-26) as a Bug claiming
> {{InjectRuntimeFilter}} produces silent false negatives on non-binary
> collation StringType join keys. *That claim was wrong* and is hereby
> retracted.
> Empirical disproof: a fresh end-to-end test
> ({{InjectRuntimeFilterCollationSuite}}, 3 cases: UTF8_LCASE hit /
> UNICODE_CI hit / UTF8_BINARY no-op) PASSES on master @ {{29fdcefb8f2}}
> with no fix applied. Verbatim optimized plan dump:
> {noformat}
> Project [v#7]
> +- Join Inner, (collationkey(k#6) = collationkey(k#8))
> :- Filter might_contain(scalar-subquery#14 [],
> xxhash64(collationkey(k#6), 42))
> : : +- Aggregate [bloom_filter_agg(
> : : xxhash64(collationkey(k#8), 42), ...) AS
> bloomFilter#13]
> : : +- Project [k#8]
> : : +- Filter (isnotnull(tag#9) AND (tag#9 = 1))
> : : +- Relation spark_catalog.default.bf_l[k#8,tag#9] parquet
> {noformat}
> Root cause: {{RewriteCollationJoin}} analyzer rule
> ({{sql/catalyst/.../analysis/RewriteCollationJoin.scala}}, introduced
> by SPARK-48000 / {{e6236af3d08}}) already wraps non-binary collated
> join keys with {{CollationKey(expr)}} *before* the optimizer runs.
> {{InjectRuntimeFilter}} therefore sees an already-wrapped key and
> emits {{xxhash64(collationkey(k), 42)}} on both build and probe sides
> symmetrically. The BloomFilter contract holds by construction.
> The original reproducer measured a hand-written SQL shape
> ({{xxhash64(name)}} composed by the user) that {{InjectRuntimeFilter}}
> never produces. Failing test != failing production path; I conflated
> the two. Apologies for the noise.
> ----
> h3. Reshape: this ticket now tracks a different, narrower question
> Reclassified to Improvement / Minor.
> h4. Topic — DataFrameStatFunctions.bloomFilter is collation-blind
> {{DataFrameStatFunctions.bloomFilter(col, expectedNumItems, numBits)}}
> (public API since 2.0) is collation-blind. When invoked over a column
> whose schema collation is non-binary (e.g. UTF8_LCASE), the resulting
> {{org.apache.spark.util.sketch.BloomFilter}} treats values byte-wise:
> {{mightContainString("alice")}} returns {{false}} after building over
> {{"Alice"}} under UTF8_LCASE, even though the column's own {{=}}
> semantics would consider the two equal.
> Empirical evidence (test {{BloomFilterPath2ReproSuite}} on master @
> {{29fdcefb8f2}}):
> {noformat}
> [PATH2-SCHEMA] name: string collate UTF8_LCASE
> [PATH2-OUTCOME] BUILT-OK | mightContain('Alice')=true
> | mightContain('alice')=false
> | mightContain('Bob')=true
> | mightContain('bob')=false
> [PATH2-BASELINE] binary col: 'Alice'=true 'alice'=false
> {noformat}
> The UTF8_LCASE column behaves byte-for-byte identically to the
> UTF8_BINARY baseline -- the API ignores collation.
> h4. Strict contract analysis
> The returned object is {{org.apache.spark.util.sketch.BloomFilter}}, a
> plain Java sketch. Its {{mightContainString(String)}} method has never
> promised collation-aware semantics; it operates on raw UTF-8 bytes.
> Strictly speaking, no contract is violated.
> However, from a Spark SQL user's perspective there is a real semantic
> gap: {{WHERE name = 'alice'}} on a UTF8_LCASE column matches {{Alice}},
> but the BloomFilter built from the same column via this API does not.
> The mismatch is silent (no exception, no warning) and easy to miss.
> h4. Possible directions (not committed)
> Listed for discussion; ticket scope is "decide a direction", not
> "prescribe one".
> * {*}Reject at API boundary{*}: throw
> {{IllegalArgumentException}} when called on a non-UTF8_BINARY
> collation column. Smallest behavioural change; would break any
> existing user (if any) silently relying on the byte-wise semantics.
> * {*}Wrap with CollationKey internally{*}: at the API call site,
> rewrite the column to {{CollationKey(col)}} before invoking
> {{bloom_filter_agg}}. Build side becomes collation-aware. Probe
> side ({{BloomFilter.mightContainString}}) would also need a
> wrapper, ideally a {{CollationAwareBloomFilter}} subclass that
> applies {{getCollationKey}} on each probe. Larger surface change;
> serialization compatibility and PySpark parity need separate
> thought.
> * {*}Documentation only{*}: clarify in scaladoc that
> {{DataFrameStatFunctions.bloomFilter}} is byte-wise regardless of
> column collation, and recommend explicit
> {{collation_key(col)}} composition. Smallest scope, leaves the
> silent gap in place.
> h4. Reproducer
> Test file
> {{sql/core/src/test/scala/org/apache/spark/sql/BloomFilterPath2ReproSuite.scala}}
> (local at this reporter's worktree; can be contributed if there is
> appetite for direction (a) or (b)).
> ----
> h3. Out of scope for this ticket
> * {{InjectRuntimeFilter}} / runtime bloom filter on join keys --
> already correct via {{RewriteCollationJoin}} (see retraction
> above).
> * User-written {{bloom_filter_agg(xxhash64(string_col))}} +
> {{might_contain(bf, xxhash64('literal'))}} -- a user is hashing
> manually; the result depends on what the user wrote.
> * {{BloomFilterAggregate.BinaryUpdater}} -- reachable only via the
> same Java-API path above; covered if that path is fixed.
> ----
> Reporter: Kent Yao
> Affects Versions: 4.0.0, 4.1.0, 5.0.0
> Component: SQL
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]