[
https://issues.apache.org/jira/browse/ARROW-12046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427570#comment-17427570
]
Benson Muite commented on ARROW-12046:
--------------------------------------
Compression will be helpful. One way of sorting Mandarin is by stroke count,
for which there is [a table for
unicode|https://www.unicode.org/reports/tr38/#SortingAlgorithm] However, this
is not the only way so allowing developers freedom to create appropriate
comparison functions is useful.
[Minimal-icu-collation|https://github.com/Mytherin/minimal-icu-collation] also
seems helpful, but is not vectorized. Javascript has a specification for
[comparison
functions|https://402.ecma-international.org/8.0/#sec-collator-compare-functions]
- while there will be a loss of efficiency, something like this seems most
appropriate. An example of using this can be found in this [DataTables
blog|https://www.datatables.net/blog/2017-02-28].
[QString|https://doc.qt.io/qt-5/qstring.html] has a [locale aware
comparison|https://doc.qt.io/qt-5/qstring.html#localeAwareCompare] but
licensing only allows incorporation of Apache code into GPL or LGPL code, and
not the other way round.
> [C++] Support string collation for sorting
> ------------------------------------------
>
> Key: ARROW-12046
> URL: https://issues.apache.org/jira/browse/ARROW-12046
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Affects Versions: 3.0.0
> Reporter: Ian Cook
> Priority: Major
> Fix For: 7.0.0
>
>
> Currently the C++ library orders strings lexicographically as bytestrings. We
> should implement the capability to change string sorting behavior based on
> locale settings for string collation.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)