[
https://issues.apache.org/jira/browse/ARROW-14122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17420213#comment-17420213
]
QP Hou edited comment on ARROW-14122 at 9/26/21, 6:09 AM:
----------------------------------------------------------
[~westonpace] my dev list thread was proposing that we should make interval
type partially ordered in arrow compute, which is what I am working on at the
moment for the new rust implementation:
https://github.com/jorgecarleitao/arrow2/pull/398. The reason I proposed that
is because I am trying to make the compute behavior compatible with the type
semantics defined in the Arrow spec. It would be odd if in the spec we specify
that hours in day can vary between 23 to 25 hours due to daylight saving, but
always use 24 hours in compute. However, I found partial order semantic is not
easy for users to understand due to various edge-cases. For example, should we
consider "1 days 22 hours" greater than "2 days -22 hours"? "1 days 23 hours"
is not comparable to "2 days" because 2 days could have 50 hours or 46 hours,
but should we consider "1 days 50 hours" greater than "2 days"?
I also like the Joda time approach you mentioned in
https://lists.apache.org/thread.html/rb7c2f111c4fb07ca7a0182f5608cf1380e6daabc05846e8503c1a7c3%40%3Cdev.arrow.apache.org%3E.
Making interval type totally unordered and require users to use it together
with timestamp for ordering makes everything really easy to understand.
For datafusion, we will go with postgres's approach because it aims to be
postgres compatible. This is not a problem for datafusion SQL interface because
we never said the SQL types maps one to one to Arrow types. In order words,
Arrow interval type semantic is an implementation detail that's hidden from the
users. The consequence of postgres's behavior is we won't be able to simply
hash interval types by their physical bytes. We will need to normalize them
first, i.e. "1 days 24 days" and "2 days" should result in the same hash key in
hash aggregate and hash join compute kernels. Or maybe we could even make this
compute semantic configurable in datafusion if different users need different
behavior depending on their needs.
Regardless which way we go, I think it would be good for all Arrow compute
implementations to have the same consistent behavior.
I am not familiar with the CPP code base, so please correct me if I am wrong.
[~cpcloud] I believe the https://github.com/apache/arrow/pull/10960 focuses on
computing the interval from two timestamps, but not ordering between intervals?
was (Author: houqp):
[~westonpace] my dev list thread was proposing that we should make interval
type partially ordered in arrow compute, which is what I am working on at the
moment for the new rust implementation:
https://github.com/jorgecarleitao/arrow2/pull/398. The reason I proposed that
is because I am trying to make the compute behavior compatible with the type
semantics defined in the Arrow spec. It would be odd if in the spec we specify
that hours in day can vary between 23 to 25 hours due to daylight saving, but
always use 24 hours in compute. However, I found partial order semantic is not
easy for users to understand due to various edge-cases. For example, should we
consider "1 days 22 hours" greater than "2 days -22 hours"? "1 days 23 hours"
is not comparable to "2 days" because 2 days could have 50 hours or 46 hours,
but should we consider "1 days 50 hours" greater than "2 days"?
I also like the Joda time approach you mentioned in
https://lists.apache.org/thread.html/rb7c2f111c4fb07ca7a0182f5608cf1380e6daabc05846e8503c1a7c3%40%3Cdev.arrow.apache.org%3E.
Making interval type totally unordered and require users to use it together
with timestamp for ordering makes everything really easy to understand.
For datafusion, we will go with postgres's approach because it aims to be
postgres compatible. This is not a problem for datafusion SQL interface because
we never said the SQL types maps one to one to Arrow types. In order words,
Arrow interval type semantic is an implementation detail that's hidden from the
users. The consequence of postgres's behavior is we won't be able to simply
hash interval types by their physical bytes. We will need to normalize them
first, i.e. "1 days 24 days" and "2 days" should result in the same hash key in
hash aggregate and hash join compute kernels. Or maybe we could even make this
compute semantic configurable if different users need different behavior
depending on their needs.
Regardless which way we go, I think it would be good for all Arrow compute
implementations to have the same consistent behavior.
I am not familiar with the CPP code base, so please correct me if I am wrong.
[~cpcloud] I believe the https://github.com/apache/arrow/pull/10960 focuses on
computing the interval from two timestamps, but not ordering between intervals?
> [C++] interval comparison kernels
> ---------------------------------
>
> Key: ARROW-14122
> URL: https://issues.apache.org/jira/browse/ARROW-14122
> Project: Apache Arrow
> Issue Type: Sub-task
> Reporter: Phillip Cloud
> Priority: Major
> Labels: kernel
>
> Subtask for tracking interval comparison kernels
--
This message was sent by Atlassian Jira
(v8.3.4#803005)