dbatomic opened a new pull request, #44537:
URL: https://github.com/apache/spark/pull/44537

   # Rough POC for collations in Spark
   
   ## High level changes
   - Collation suite that test currently supported features (start with this 
file).
   - Global, singleton `CollatorFactory`. For given collation name or for given 
comparator id (cached representation of collation) it provides collation aware 
comparator that can be used by `UTF8String`. (proper design here is todo, it 
probably shouldn't be a singleton).
   - `UTF8String` is extended with single integer that specifies collation. We 
could be even more aggressive and pack this integer into a short, or even a 
byte. Id represents cached comparator id that can be fetched from 
`CollatorFactory`.
   - `UTF8String` respects this id for equality checks and compares.
   - New type called `CollatedStringType` with physical type 
`PhysicalCollatedStringType` that at the end maps back to `UTF8String`. Basic 
support for this new type across code base.
   - Support for aggregates, given that they currently rely on pure byte for 
byte comparison for group building.
   - Support for merge join (hash based joins are TODO).
   - POC uses java's default collator. TODO is to switch to ICU most likely. 
Collator changes are scoped to single file, so it shouldn't be hard to replace 
java's collator with ICU.
   
   ## Supported features at this point:
   - `collate` expression -> input string is casted to `CollatedStringType` 
with given collation.
   - Collation rules are java collator based. Caller provides locale and 
strength (primary, secondary, tertiary). E.g. `collate(input, 'sr-primary')` 
will collate input with Serbian locale that ignores both casing and accents. 
Secondary will ignore casing but respect accents and tertiary will respect both.
   - `collation` expression -> returns collation name of given input.
   - Support for basic operators (filters, aggregate, joins, views, inline 
tables etc.).
   
   Proper testing (and creating real test strategy is TBD).
   
   TBD is parquet and delta support, different collation levels (column level, 
table level, database level) and much more extensive testing of other features.
   
   Suggestion for reviewers of this POC is to start with `CollationSuite` and 
newly tests in `UTF8StringSuite` to get the gist of the changes in this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to