[ 
https://issues.apache.org/jira/browse/SPARK-47617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikola Mandic updated SPARK-47617:
----------------------------------
    Description: 
As collation support grows across all SQL features and new collation types are 
added, we need to have reliable testing model covering as many standard SQL 
capabilities as possible.

We can utilize TPC-DS testing infrastructure already present in Spark. The idea 
is to vary TPC-DS table string columns by adding multiple collations with 
different ordering rules and case sensitivity, producing new tables. These 
tables should yield the same results against predefined TPC-DS queries for 
certain batches of collations. For example, when comparing query runs on table 
where columns are first collated as UTF8_BINARY and then as UTF8_BINARY_LCASE, 
we should be getting same results after converting to lowercase.

Introduce new query suite which tests the described behavior with available 
collations (utf8_binary and unicode) combined with case conversions (lowercase, 
uppercase, randomized case for fuzzy testing).

  was:
As collation support grows across all SQL features and new collation types are 
added, we need to have reliable testing model covering as many standard SQL 
capabilities as possible.

We can utilize TCP-DS testing infrastructure already present in Spark. The idea 
is to vary TCP-DS table string columns by adding multiple collations with 
different ordering rules and case sensitivity, producing new tables. These 
tables should yield the same results against predefined TCP-DS queries for 
certain batches of collations. For example, when comparing query runs on table 
where columns are first collated as UTF8_BINARY and then as UTF8_BINARY_LCASE, 
we should be getting same results after converting to lowercase.

Introduce new query suite which tests the described behavior with available 
collations (utf8_binary and unicode) combined with case conversions (lowercase, 
uppercase, randomized case for fuzzy testing).


> Add TPC-DS testing infrastructure for collations
> ------------------------------------------------
>
>                 Key: SPARK-47617
>                 URL: https://issues.apache.org/jira/browse/SPARK-47617
>             Project: Spark
>          Issue Type: Task
>          Components: SQL
>    Affects Versions: 4.0.0
>            Reporter: Nikola Mandic
>            Priority: Major
>
> As collation support grows across all SQL features and new collation types 
> are added, we need to have reliable testing model covering as many standard 
> SQL capabilities as possible.
> We can utilize TPC-DS testing infrastructure already present in Spark. The 
> idea is to vary TPC-DS table string columns by adding multiple collations 
> with different ordering rules and case sensitivity, producing new tables. 
> These tables should yield the same results against predefined TPC-DS queries 
> for certain batches of collations. For example, when comparing query runs on 
> table where columns are first collated as UTF8_BINARY and then as 
> UTF8_BINARY_LCASE, we should be getting same results after converting to 
> lowercase.
> Introduce new query suite which tests the described behavior with available 
> collations (utf8_binary and unicode) combined with case conversions 
> (lowercase, uppercase, randomized case for fuzzy testing).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to