[ 
https://issues.apache.org/jira/browse/ARROW-14290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17428021#comment-17428021
 ] 

Benson Muite commented on ARROW-14290:
--------------------------------------

The order given by Unicode may not match text application.  Unicode has a 
[collation algorithm|https://www.unicode.org/reports/tr10/]  motivating 
examples they give for developing this include
 * [|https://www.unicode.org/reports/tr10/#Example_Differences_Table]In Swedish 
z < ö but in German ö < z
 * In a German Dictionary of < öf but in a German Phonebook öf < of

Being able to do fast sorting and comparison on text is important in a database 
and the unicode encodings do not give orderings that matche all applications.  
Thus, we want to enable application developers to choose an ordering 
appropriate for their applications, ideally without re-implementing comparison 
and sorting.  It may be the case that language plugins or extensions are easier 
to support. An internal lookup table is ok, but the table size may vary from 
~20 to ~5000 rows for a single language. If one needs to work with Emojis, 
things get more interesting.  UTF8 can encode 1,112,064 code points and there 
are cases where two code points are used to encode one item, so something 
complete may have very poor worst case running time.  One may also want to use 
another text encoding, for example UTF32, EUC-KR, SJIS, EUC-JP - especially 
since then sorting, searching and comparisons can then be done more 
efficiently. An interface that allows returning of a comparison may allow 
flexibility to adapt for different circumstances.  Arrow allows for flexible 
schema, so optimizations may be possible which are not possible with a regular 
column.

> [C++] String comparison in between ternary kernel
> -------------------------------------------------
>
>                 Key: ARROW-14290
>                 URL: https://issues.apache.org/jira/browse/ARROW-14290
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Benson Muite
>            Assignee: Benson Muite
>            Priority: Minor
>
> String comparisons in C++ will use order by unicode. This may not be suitable 
> in many language applications, for example when using characters from 
> languages that use more than ASCII.   Sorting algorithms can often allow for 
> the use of custom comparison functions.  It would be helpful to allow for 
> this for the between kernel as well.  Initial work on the between kernel is 
> being tracked in https://issues.apache.org/jira/browse/ARROW-9843



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to