[jira] [Commented] (ARROW-16172) [C++] cast when reasonable for join keys

Michal Nowakiewicz (Jira) Wed, 13 Apr 2022 12:32:06 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-16172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521900#comment-17521900
 ]


Michal Nowakiewicz commented on ARROW-16172:
--------------------------------------------

I have two comments for this:
1) This is related to what we do in hash join right now for dictionary support. 
We allow to join a column with non dictionary type to a column with dictionary 
type, mapping one of them to the common representation before hashing and 
comparing. It seems to me that automatic type-casting for join key should be a 
part of the same solution/framework as the support for dictionaries.
2) There is a challenge related to Bloom filters here. When we push down Bloom 
filters, the probing of the filter happens in a different exec node than the 
hash join that generated Bloom filter. The hashes must be generated in the same 
way from the same data types in order for Bloom filter to work correctly. That 
means that if we type cast the key on the probe side we need to do it twice - 
once for computing hash for Bloom filter, once for doing hash table lookup. But 
the data type conversion may not be cheap and doing it twice can diminish the 
benefits of Bloom filter filtering. So the challenge is to think if we can do 
it once and keep the result until it is need the second time (as if there was a 
project at the beginning of the pipeline doing explicit type casting and 
perhaps even hash computation and storing result in an exec batch).

> [C++] cast when reasonable for join keys
> ----------------------------------------
>
>                 Key: ARROW-16172
>                 URL: https://issues.apache.org/jira/browse/ARROW-16172
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Jonathan Keane
>            Priority: Major
>
> Joining an integer column with a float column that happens to have whole 
> numbers errors. For kernels, we would autocast in this circumstance, so it's 
> a surprising UX that this doesn't work + I need to type coerce on my own for 
> this.
> {code}
> library(arrow, warn.conflicts = FALSE)
> #> See arrow_info() for available features
> library(dplyr, warn.conflicts = FALSE)
> tab_int <- arrow_table(data.frame(let = letters, num = 1L:26L))
> tab_float <- arrow_table(data.frame(let = letters, num = as.double(1:26)))
> left_join(tab_int, tab_float) %>% collect()
> #> Error in `handle_csv_read_error()`:
> #> ! Invalid: Incompatible data types for corresponding join field keys: 
> FieldRef.Name(num) of type int32 and FieldRef.Name(num) of type double
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-16172) [C++] cast when reasonable for join keys

Reply via email to