[
https://issues.apache.org/jira/browse/ARROW-14290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17428021#comment-17428021
]
Benson Muite commented on ARROW-14290:
--------------------------------------
The order given by Unicode may not match text application. Unicode has a
[collation algorithm|https://www.unicode.org/reports/tr10/] motivating
examples they give for developing this include
* [|https://www.unicode.org/reports/tr10/#Example_Differences_Table]In Swedish
z < ö but in German ö < z
* In a German Dictionary of < öf but in a German Phonebook öf < of
Being able to do fast sorting and comparison on text is important in a database
and the unicode encodings do not give orderings that matche all applications.
Thus, we want to enable application developers to choose an ordering
appropriate for their applications, ideally without re-implementing comparison
and sorting. It may be the case that language plugins or extensions are easier
to support. An internal lookup table is ok, but the table size may vary from
~20 to ~5000 rows for a single language. If one needs to work with Emojis,
things get more interesting. UTF8 can encode 1,112,064 code points and there
are cases where two code points are used to encode one item, so something
complete may have very poor worst case running time. One may also want to use
another text encoding, for example UTF32, EUC-KR, SJIS, EUC-JP - especially
since then sorting, searching and comparisons can then be done more
efficiently. An interface that allows returning of a comparison may allow
flexibility to adapt for different circumstances. Arrow allows for flexible
schema, so optimizations may be possible which are not possible with a regular
column.
> [C++] String comparison in between ternary kernel
> -------------------------------------------------
>
> Key: ARROW-14290
> URL: https://issues.apache.org/jira/browse/ARROW-14290
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Benson Muite
> Assignee: Benson Muite
> Priority: Minor
>
> String comparisons in C++ will use order by unicode. This may not be suitable
> in many language applications, for example when using characters from
> languages that use more than ASCII. Sorting algorithms can often allow for
> the use of custom comparison functions. It would be helpful to allow for
> this for the between kernel as well. Initial work on the between kernel is
> being tracked in https://issues.apache.org/jira/browse/ARROW-9843
--
This message was sent by Atlassian Jira
(v8.3.4#803005)