Jungtaek Lim created SPARK-47776:
------------------------------------

             Summary: State store operation cannot work properly with binary 
inequality collation
                 Key: SPARK-47776
                 URL: https://issues.apache.org/jira/browse/SPARK-47776
             Project: Spark
          Issue Type: Bug
          Components: Structured Streaming
    Affects Versions: 4.0.0
            Reporter: Jungtaek Lim


Arguably this is a correctness issue, though we haven't released collation 
feature yet.

collation introduces the concept of binary (in)equality, which means in some 
collation we no longer be able to just compare the binary format of two 
UnsafeRows to determine equality.

For example, 'aaa' and 'AAA' can be "semantically" same in case insensitive 
collation.

State store is basically key-value storage, and the most provider 
implementations rely on the fact that all the columns in the key schema support 
binary equality. We need to disallow using binary inequality column in the key 
schema, before we could support this in majority of state store providers (or 
high-level of state store.)

Why this is correctness issue? For example, streaming aggregation will produce 
an output of aggregation which does not care about the semantic equality.

e.g. df.groupBy(strCol).count() 

Although strCol is case insensitive, 'a' and 'A' won't be counted together in 
streaming aggregation, while they should be.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to