dbatomic opened a new pull request, #44537: URL: https://github.com/apache/spark/pull/44537
# Rough POC for collations in Spark ## High level changes - Collation suite that test currently supported features (start with this file). - Global, singleton `CollatorFactory`. For given collation name or for given comparator id (cached representation of collation) it provides collation aware comparator that can be used by `UTF8String`. (proper design here is todo, it probably shouldn't be a singleton). - `UTF8String` is extended with single integer that specifies collation. We could be even more aggressive and pack this integer into a short, or even a byte. Id represents cached comparator id that can be fetched from `CollatorFactory`. - `UTF8String` respects this id for equality checks and compares. - New type called `CollatedStringType` with physical type `PhysicalCollatedStringType` that at the end maps back to `UTF8String`. Basic support for this new type across code base. - Support for aggregates, given that they currently rely on pure byte for byte comparison for group building. - Support for merge join (hash based joins are TODO). - POC uses java's default collator. TODO is to switch to ICU most likely. Collator changes are scoped to single file, so it shouldn't be hard to replace java's collator with ICU. ## Supported features at this point: - `collate` expression -> input string is casted to `CollatedStringType` with given collation. - Collation rules are java collator based. Caller provides locale and strength (primary, secondary, tertiary). E.g. `collate(input, 'sr-primary')` will collate input with Serbian locale that ignores both casing and accents. Secondary will ignore casing but respect accents and tertiary will respect both. - `collation` expression -> returns collation name of given input. - Support for basic operators (filters, aggregate, joins, views, inline tables etc.). Proper testing (and creating real test strategy is TBD). TBD is parquet and delta support, different collation levels (column level, table level, database level) and much more extensive testing of other features. Suggestion for reviewers of this POC is to start with `CollationSuite` and newly tests in `UTF8StringSuite` to get the gist of the changes in this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
