[
https://issues.apache.org/jira/browse/ARROW-16172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521318#comment-17521318
]
Weston Pace commented on ARROW-16172:
-------------------------------------
So all the auto-casting we do today happens inside of expression evaluation
which is why I suppose it isn't handled in the join. In the expression
evaluation we are looking for a combination of kernel and casts. Here we would
just be looking for a "least common ancestor" of sorts. It is an easier
problem but it is a slightly different problem and so we don't have an exact
function for it. I'll attempt to transcribe a rough approximation of our
current rules so we can see if they make sense:
* The current casting mechanism dictionary decodes everything so the common
ancestor of Dictionary<int8, string> and Dictionary<int8, large_string> is
large_string and not Dictionary<int8, large_string>.
** I don't know if the hash join node decodes dictionaries or not but it would
be nice if it were up to the hash join node and so I think the "least common
ancestor" logic should not automatically decode.
* We never automatically cast from string to integer or boolean to integer. I
could see some people arguing these are valid casts but I think the majority
would agree with the decisions here.
* null + x => x
* decimal + float => float
* decimal + int => decimal
* decimal128 + decimal256 => decimal256
* decimal<A, B> + decimal<C,D> => decimal<E,F> based on rules appropriate to
the function
** In this case I think the rule would be to grow as much as possible and
sacrifice scale if needed.
*** decimal<20,6> + decimal<10,8> => decimal<20,8>
*** decimal<20,10> + decimal<36,6> => decimal<38,8>
* float + int => float
* int + uint => unsigned next power of 2 (e.g. int32 + uint32 => uint64)
** Caps at uint64 so int64 + uint64 => uint64
* int + int / uint + uint => widest type (e.g. int32 + int16 => int16)
* temporal + temporal => finest resolution (e.g. timestamp<s> + timestamp<ms>
=> timestamp<ms>
** Crossing of temporal types is not allowed (e.g. cannot go from date to
duration or from duration to timestamp)
I think these rules still hold up (excepting possibly dictionary encoding but
we could solve that later). I wouldn't be opposed to adding this to the join.
> it's a surprising UX that this doesn't work + I need to type coerce on my own
> for this.
In general, I think it is a slippery slope from "intuitive UX" to "invisibly
inefficient queries". There's also the "death by 1000 cuts" phenomenon of
gradually and unintentionally writing our own plan optimizer that we then have
to maintain. However, this could all just be FUD. On the face of it, I can't
see anything immediately harmful with adding this functionality.
> [C++] cast when reasonable for join keys
> ----------------------------------------
>
> Key: ARROW-16172
> URL: https://issues.apache.org/jira/browse/ARROW-16172
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Jonathan Keane
> Priority: Major
>
> Joining an integer column with a float column that happens to have whole
> numbers errors. For kernels, we would autocast in this circumstance, so it's
> a surprising UX that this doesn't work + I need to type coerce on my own for
> this.
> {code}
> library(arrow, warn.conflicts = FALSE)
> #> See arrow_info() for available features
> library(dplyr, warn.conflicts = FALSE)
> tab_int <- arrow_table(data.frame(let = letters, num = 1L:26L))
> tab_float <- arrow_table(data.frame(let = letters, num = as.double(1:26)))
> left_join(tab_int, tab_float) %>% collect()
> #> Error in `handle_csv_read_error()`:
> #> ! Invalid: Incompatible data types for corresponding join field keys:
> FieldRef.Name(num) of type int32 and FieldRef.Name(num) of type double
> {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)