[
https://issues.apache.org/jira/browse/ARROW-13573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoine Pitrou resolved ARROW-13573.
------------------------------------
Resolution: Fixed
Issue resolved by pull request 11022
[https://github.com/apache/arrow/pull/11022]
> [C++] Support dictionaries directly in case_when kernel
> -------------------------------------------------------
>
> Key: ARROW-13573
> URL: https://issues.apache.org/jira/browse/ARROW-13573
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: David Li
> Assignee: David Li
> Priority: Major
> Labels: kernel, pull-request-available, types
> Fix For: 6.0.0
>
> Time Spent: 4h 10m
> Remaining Estimate: 0h
>
> case_when (and other similar kernels) currently dictionary-decode inputs,
> then operate on the decoded values. This is both inefficient and unexpected.
> We should instead operate directly on dictionary indices.
> Of course, this introduces more edge cases. If the dictionaries of inputs do
> not match, we have the following choices:
> # Raise an error.
> # Unify the dictionaries.
> # Use one of the dictionaries, and raise an error if an index of another
> dictionary cannot be mapped to an index of the chosen dictionary.
> # Use one of the dictionaries, and emit null if an index of another
> dictionary cannot be mapped to an index of the chosen dictionary. (This is
> what base dplyr if_else does with factors.)
> All of these options are reasonable, so we should introduce an options
> struct. We can implement #3 and #4 at first (to cover R); #2 isn't strictly
> necessary, as the user can unify the dictionaries manually first, but it may
> be more efficient to do it this way. Similarly, #1 isn't strictly necessary.
> #3 and #4 are justifiable (beyond just "it's what R does") since users may
> filter down disjoint dictionaries into a set of common values and then expect
> to combine the remaining values with a kernel like case_when.
> As described on
> [GitHub|https://github.com/apache/arrow/pull/10724#discussion_r682671015].
--
This message was sent by Atlassian Jira
(v8.3.4#803005)