[jira] [Commented] (ARROW-14122) [C++] interval comparison kernels

Weston Pace (Jira) Mon, 27 Sep 2021 11:10:06 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-14122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17420946#comment-17420946
 ]


Weston Pace commented on ARROW-14122:
-------------------------------------

> For datafusion, we will go with postgres's approach because it aims to be 
> postgres compatible. This is not a problem for datafusion SQL interface 
> because we never said the SQL types maps one to one to Arrow types. In order 
> words, Arrow interval type semantic is an implementation detail that's hidden 
> from the users. The consequence of postgres's behavior is we won't be able to 
> simply hash interval types by their physical bytes. We will need to normalize 
> them first, i.e. "1 days 24 days" and "2 days" should result in the same hash 
> key in hash aggregate and hash join compute kernels. Or maybe we could even 
> make this compute semantic configurable in datafusion if different users need 
> different behavior depending on their needs.

> Regardless which way we go, I think it would be good for all Arrow compute 
> implementations to have the same consistent behavior.

An SQL postgres query will still need to map down to some kind of IR so even if 
we don't define it at the "Arrow data type" level I think it would need to be 
defined at some level.

What if we were to phrase it this way:

* The Interval type has no ordering (looks like partial ordering is up for 
debate but I don't actually know what that buys us)
* There is an extension type "Postgres Interval" (I don't think it matters 
whether we call it an Arrow extension type, an Arrow Compute IR type, or a 
substrait type) which has a total ordering based on 24 hour days, 30 day 
months, and 360 day years
* There is a cast from Arrow interval to Postgres Interval

Query plan producers that want to maintain Postgres compatibility can insert 
the cast 

So then, if I understand correctly, the point on hashing comes down to whether 
or not the cast from Arrow Interval to Postgres Interval is a zero-copy 
metadata only cast or the bytes need to be mutated for consistent hashing.  I 
don't know enough about the design of either system's hashing impl to answer 
that.

> [C++] interval comparison kernels
> ---------------------------------
>
>                 Key: ARROW-14122
>                 URL: https://issues.apache.org/jira/browse/ARROW-14122
>             Project: Apache Arrow
>          Issue Type: Sub-task
>            Reporter: Phillip Cloud
>            Priority: Major
>              Labels: kernel
>
> Subtask for tracking interval comparison kernels



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-14122) [C++] interval comparison kernels

Reply via email to