[ 
https://issues.apache.org/jira/browse/ARROW-14290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17428134#comment-17428134
 ] 

Benson Muite commented on ARROW-14290:
--------------------------------------

Column stores are great for vectorization  [as explained in this 
tutorial,|http://nms.csail.mit.edu/~stavros/pubs/tutorial2009-column_stores.pdf]
 though maybe there is something better.  My expectation is that the ordering 
comparison would use data that can be cached since it may be faster to stream 
all the data being compared O( column length ) and do a lookup O( log 
dictionary size ) for a total of O( column length * log dictionary size) 
operations, rather than stream the lookups past a small section of the data 
O(column length * dictionary size).. This depends on relative data sizes and 
constants determined by the hardware and software implementations though.

For numeric data, temporal data and text data where UTF8 encoding is sufficient 
for comparison, nothing special is needed and hardware support should be good.  
For text data where the ordering used by UTF8 is not appropriate, being able to 
either call a separate column of data or call a separate table organized as a 
tree or call a separate function is helpful.  Calling a separate column of data 
would fit well within Arrow, but there may be situations where this is 
inappropriate and it may be good to allow for this.

> [C++] String comparison in between ternary kernel
> -------------------------------------------------
>
>                 Key: ARROW-14290
>                 URL: https://issues.apache.org/jira/browse/ARROW-14290
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Benson Muite
>            Assignee: Benson Muite
>            Priority: Minor
>
> String comparisons in C++ will use order by unicode. This may not be suitable 
> in many language applications, for example when using characters from 
> languages that use more than ASCII.   Sorting algorithms can often allow for 
> the use of custom comparison functions.  It would be helpful to allow for 
> this for the between kernel as well.  Initial work on the between kernel is 
> being tracked in https://issues.apache.org/jira/browse/ARROW-9843



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to