[jira] [Commented] (ARROW-16172) [C++] cast when reasonable for join keys

Weston Pace (Jira) Tue, 12 Apr 2022 13:09:05 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-16172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521318#comment-17521318
 ]


Weston Pace commented on ARROW-16172:
-------------------------------------

So all the auto-casting we do today happens inside of expression evaluation 
which is why I suppose it isn't handled in the join.  In the expression 
evaluation we are looking for a combination of kernel and casts.  Here we would 
just be looking for a "least common ancestor" of sorts.  It is an easier 
problem but it is a slightly different problem and so we don't have an exact 
function for it.  I'll attempt to transcribe a rough approximation of our 
current rules so we can see if they make sense:

* The current casting mechanism dictionary decodes everything so the common 
ancestor of Dictionary<int8, string> and Dictionary<int8, large_string> is 
large_string and not Dictionary<int8, large_string>.
** I don't know if the hash join node decodes dictionaries or not but it would 
be nice if it were up to the hash join node and so I think the "least common 
ancestor" logic should not automatically decode.
* We never automatically cast from string to integer or boolean to integer.  I 
could see some people arguing these are valid casts but I think the majority 
would agree with the decisions here.
* null + x => x
* decimal + float => float
* decimal + int => decimal
* decimal128 + decimal256 => decimal256
* decimal<A, B> + decimal<C,D> => decimal<E,F> based on rules appropriate to 
the function
** In this case I think the rule would be to grow as much as possible and 
sacrifice scale if needed.
*** decimal<20,6> + decimal<10,8> => decimal<20,8>
*** decimal<20,10> + decimal<36,6> => decimal<38,8>
* float + int => float
* int + uint => unsigned next power of 2 (e.g. int32 + uint32 => uint64)
** Caps at uint64 so int64 + uint64 => uint64
* int + int / uint + uint => widest type (e.g. int32 + int16 => int16)
* temporal + temporal => finest resolution (e.g. timestamp<s> + timestamp<ms> 
=> timestamp<ms>
** Crossing of temporal types is not allowed (e.g. cannot go from date to 
duration or from duration to timestamp)

I think these rules still hold up (excepting possibly dictionary encoding but 
we could solve that later).  I wouldn't be opposed to adding this to the join.

> it's a surprising UX that this doesn't work + I need to type coerce on my own 
> for this.

In general, I think it is a slippery slope from "intuitive UX" to "invisibly 
inefficient queries".  There's also the "death by 1000 cuts" phenomenon of 
gradually and unintentionally writing our own plan optimizer that we then have 
to maintain.  However, this could all just be FUD.  On the face of it, I can't 
see anything immediately harmful with adding this functionality.

> [C++] cast when reasonable for join keys
> ----------------------------------------
>
>                 Key: ARROW-16172
>                 URL: https://issues.apache.org/jira/browse/ARROW-16172
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Jonathan Keane
>            Priority: Major
>
> Joining an integer column with a float column that happens to have whole 
> numbers errors. For kernels, we would autocast in this circumstance, so it's 
> a surprising UX that this doesn't work + I need to type coerce on my own for 
> this.
> {code}
> library(arrow, warn.conflicts = FALSE)
> #> See arrow_info() for available features
> library(dplyr, warn.conflicts = FALSE)
> tab_int <- arrow_table(data.frame(let = letters, num = 1L:26L))
> tab_float <- arrow_table(data.frame(let = letters, num = as.double(1:26)))
> left_join(tab_int, tab_float) %>% collect()
> #> Error in `handle_csv_read_error()`:
> #> ! Invalid: Incompatible data types for corresponding join field keys: 
> FieldRef.Name(num) of type int32 and FieldRef.Name(num) of type double
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-16172) [C++] cast when reasonable for join keys

Reply via email to