[jira] [Comment Edited] (ARROW-14122) [C++] interval comparison kernels

QP Hou (Jira) Sat, 25 Sep 2021 23:10:04 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-14122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17420213#comment-17420213
 ]


QP Hou edited comment on ARROW-14122 at 9/26/21, 6:09 AM:
----------------------------------------------------------

[~westonpace] my dev list thread was proposing that we should make interval 
type partially ordered in arrow compute, which is what I am working on at the 
moment for the new rust implementation: 
https://github.com/jorgecarleitao/arrow2/pull/398. The reason I proposed that 
is because I am trying to make the compute behavior compatible with the type 
semantics defined in the Arrow spec. It would be odd if in the spec we specify 
that hours in day can vary between 23 to 25 hours due to daylight saving, but 
always use 24 hours in compute. However, I found partial order semantic is not 
easy for users to understand due to various edge-cases. For example, should we 
consider "1 days 22 hours" greater than "2 days -22 hours"? "1 days 23 hours" 
is not comparable to "2 days" because 2 days could have 50 hours or 46 hours, 
but should we consider "1 days 50 hours" greater than "2 days"?

I also like the Joda time approach you mentioned in 
https://lists.apache.org/thread.html/rb7c2f111c4fb07ca7a0182f5608cf1380e6daabc05846e8503c1a7c3%40%3Cdev.arrow.apache.org%3E.
 Making interval type totally unordered and require users to use it together 
with timestamp for ordering makes everything really easy to understand.

For datafusion, we will go with postgres's approach because it aims to be 
postgres compatible. This is not a problem for datafusion SQL interface because 
we never said the SQL types maps one to one to Arrow types. In order words, 
Arrow interval type semantic is an implementation detail that's hidden from the 
users. The consequence of postgres's behavior is we won't be able to simply 
hash interval types by their physical bytes. We will need to normalize them 
first, i.e. "1 days 24 days" and "2 days" should result in the same hash key in 
hash aggregate and hash join compute kernels. Or maybe we could even make this 
compute semantic configurable in datafusion if different users need different 
behavior depending on their needs.

Regardless which way we go, I think it would be good for all Arrow compute 
implementations to have the same consistent behavior.

I am not familiar with the CPP code base, so please correct me if I am wrong. 
[~cpcloud] I believe the  https://github.com/apache/arrow/pull/10960 focuses on 
computing the interval from two timestamps, but not ordering between intervals?


was (Author: houqp):
[~westonpace] my dev list thread was proposing that we should make interval 
type partially ordered in arrow compute, which is what I am working on at the 
moment for the new rust implementation: 
https://github.com/jorgecarleitao/arrow2/pull/398. The reason I proposed that 
is because I am trying to make the compute behavior compatible with the type 
semantics defined in the Arrow spec. It would be odd if in the spec we specify 
that hours in day can vary between 23 to 25 hours due to daylight saving, but 
always use 24 hours in compute. However, I found partial order semantic is not 
easy for users to understand due to various edge-cases. For example, should we 
consider "1 days 22 hours" greater than "2 days -22 hours"? "1 days 23 hours" 
is not comparable to "2 days" because 2 days could have 50 hours or 46 hours, 
but should we consider "1 days 50 hours" greater than "2 days"?

I also like the Joda time approach you mentioned in 
https://lists.apache.org/thread.html/rb7c2f111c4fb07ca7a0182f5608cf1380e6daabc05846e8503c1a7c3%40%3Cdev.arrow.apache.org%3E.
 Making interval type totally unordered and require users to use it together 
with timestamp for ordering makes everything really easy to understand.

For datafusion, we will go with postgres's approach because it aims to be 
postgres compatible. This is not a problem for datafusion SQL interface because 
we never said the SQL types maps one to one to Arrow types. In order words, 
Arrow interval type semantic is an implementation detail that's hidden from the 
users. The consequence of postgres's behavior is we won't be able to simply 
hash interval types by their physical bytes. We will need to normalize them 
first, i.e. "1 days 24 days" and "2 days" should result in the same hash key in 
hash aggregate and hash join compute kernels. Or maybe we could even make this 
compute semantic configurable if different users need different behavior 
depending on their needs.

Regardless which way we go, I think it would be good for all Arrow compute 
implementations to have the same consistent behavior.

I am not familiar with the CPP code base, so please correct me if I am wrong. 
[~cpcloud] I believe the  https://github.com/apache/arrow/pull/10960 focuses on 
computing the interval from two timestamps, but not ordering between intervals?

> [C++] interval comparison kernels
> ---------------------------------
>
>                 Key: ARROW-14122
>                 URL: https://issues.apache.org/jira/browse/ARROW-14122
>             Project: Apache Arrow
>          Issue Type: Sub-task
>            Reporter: Phillip Cloud
>            Priority: Major
>              Labels: kernel
>
> Subtask for tracking interval comparison kernels



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-14122) [C++] interval comparison kernels

Reply via email to