[ 
https://issues.apache.org/jira/browse/ARROW-13573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-13573.
------------------------------------
    Resolution: Fixed

Issue resolved by pull request 11022
[https://github.com/apache/arrow/pull/11022]

> [C++] Support dictionaries directly in case_when kernel
> -------------------------------------------------------
>
>                 Key: ARROW-13573
>                 URL: https://issues.apache.org/jira/browse/ARROW-13573
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: David Li
>            Assignee: David Li
>            Priority: Major
>              Labels: kernel, pull-request-available, types
>             Fix For: 6.0.0
>
>          Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> case_when (and other similar kernels) currently dictionary-decode inputs, 
> then operate on the decoded values. This is both inefficient and unexpected. 
> We should instead operate directly on dictionary indices.
> Of course, this introduces more edge cases. If the dictionaries of inputs do 
> not match, we have the following choices:
>  # Raise an error.
>  # Unify the dictionaries.
>  # Use one of the dictionaries, and raise an error if an index of another 
> dictionary cannot be mapped to an index of the chosen dictionary.
>  # Use one of the dictionaries, and emit null if an index of another 
> dictionary cannot be mapped to an index of the chosen dictionary. (This is 
> what base dplyr if_else does with factors.)
> All of these options are reasonable, so we should introduce an options 
> struct. We can implement #3 and #4 at first (to cover R); #2 isn't strictly 
> necessary, as the user can unify the dictionaries manually first, but it may 
> be more efficient to do it this way. Similarly, #1 isn't strictly necessary.
> #3 and #4 are justifiable (beyond just "it's what R does") since users may 
> filter down disjoint dictionaries into a set of common values and then expect 
> to combine the remaining values with a kernel like case_when.
> As described on 
> [GitHub|https://github.com/apache/arrow/pull/10724#discussion_r682671015].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to