Jungtaek Lim created SPARK-47776: ------------------------------------ Summary: State store operation cannot work properly with binary inequality collation Key: SPARK-47776 URL: https://issues.apache.org/jira/browse/SPARK-47776 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 4.0.0 Reporter: Jungtaek Lim
Arguably this is a correctness issue, though we haven't released collation feature yet. collation introduces the concept of binary (in)equality, which means in some collation we no longer be able to just compare the binary format of two UnsafeRows to determine equality. For example, 'aaa' and 'AAA' can be "semantically" same in case insensitive collation. State store is basically key-value storage, and the most provider implementations rely on the fact that all the columns in the key schema support binary equality. We need to disallow using binary inequality column in the key schema, before we could support this in majority of state store providers (or high-level of state store.) Why this is correctness issue? For example, streaming aggregation will produce an output of aggregation which does not care about the semantic equality. e.g. df.groupBy(strCol).count() Although strCol is case insensitive, 'a' and 'A' won't be counted together in streaming aggregation, while they should be. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org