[
https://issues.apache.org/jira/browse/SPARK-47410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wenchen Fan reassigned SPARK-47410:
-----------------------------------
Assignee: Uroš Bojanić
> refactor UTF8String and CollationFactory
> ----------------------------------------
>
> Key: SPARK-47410
> URL: https://issues.apache.org/jira/browse/SPARK-47410
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Affects Versions: 4.0.0
> Reporter: Uroš Bojanić
> Assignee: Uroš Bojanić
> Priority: Major
> Labels: pull-request-available
>
> This ticket addresses the need to refactor the {{UTF8String}} and
> {{CollationFactory}} classes within Spark to enhance support for
> collation-aware expressions. The goal is to improve code structure,
> maintainability, readability, and testing coverage for collation-aware Spark
> SQL expressions.
> The changed introduced herein should simplify addition of new collation-aware
> operations and ensure consistent testing across the codebase.
>
> To further support the addition of collation support for new Spark
> expressions, here are a couple of guidelines to follow:
>
> // 1. Collation-aware expression implementation
> CollationSupport.java
> * should serve as a static entry point for collation-aware expressions,
> providing custom support
> * for example: one by one Spark expression with corresponding collation
> support
> * also note that: CollationAwareUTF8String should be used for
> collation-aware UTF8String operations & other utility methods
> CollationFactory.java
> * should continue to serve as a static provider for high-level collation
> interface
> * for example: interacting with external ICU components such as Collator,
> StringSearch, etc.
> * also note that: no low-level / expression-specific code should generally
> be found here
> UTF8String.java
> * should be largely collation-unaware, and generally be used only as
> storage, nothing else
> * for example: don’t change this class at all (with the only one-time
> exception of: semanticEquals/Compare)
> * also note that: no collation-aware operation implementations (using
> collationId) should be put here
> stringExpressions.scala / regexpExpressions.scala / other
> “sql.catalyst.expressions” (for example: Between.scala)
> * should only contain minimal changes in order to re-route collation-aware
> implementations to CollationSupport
> * for example: most changes should be in relation to: adding collationId,
> using correct data types, replacements, etc.
> * also note that: nullSafeEval & doGenCode should likely note contain
> introduce extra branching based on collationId
>
> // 2. Collation-aware expression testing
> CollationSuite.scala
> * should be used for testing more general collation concepts
> * for example: collate/collation expressions, collation names, DDL, casting,
> aggregate, shuffle, join, etc.
> * also note that: no extra tests should generally be added
> CollationSupportSuite.java
> * should be used for expression unit tests, these tests should be as
> rigorous as possible in order to cover various cases
> * for example: unit tests that test collation-aware expression
> implementation for various collations (binary, lowercase, ICU)
> * also note that: these tests should generally be written after adding
> appropriate expression support in CollationSupport.java
> CollationStringExpressionsSuite.scala / CollationRegexpExpressionsSuite.scala
> / CollationExpressionSuite.scala
> * should be used for expression end-to-end tests, these tests should only
> cover crucial expression behaviour
> * for example: SQL tests that verify query execution results, expected
> return data types, casting, unsupported collation handling, etc.
> * also note that: these tests should generally be written after enabling
> appropriate expression support in stringExpressions.scala
>
> // 3. Closing notes
> * Carefully think about performance implications of newly added custom
> collation-aware expression implementation
> * for example: be very careful with extra string allocations (UTF8Strings ->
> (Java) String -> UTF8Strings, etc.)
> * also note that: some operations introduce very heavy performance penalties
> (we should avoid the ones we can)
>
> * Make sure to test all newly added expressions and completely (unit tests,
> end-to-end tests, etc.)
> * for example: consider edge, such as: empty strings, uppercase and
> lowercase mix, different byte-length chars, etc.
> * also note that: all similar tests should be uniform & readable and be kept
> in one place for various expressions
>
> * Consider how new expressions interact with the rest of the system
> (casting; collation support level - use correct AbstractStringType, etc.)
> * for example: we should watch out for casting, test it thoroughly, and use
> CollationTypeCasts if needed for new expressions
> * also note that: some expressions won’t support all collation types, so
> that should be clearly reflected in tests and dataTypes
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]