[ 
https://issues.apache.org/jira/browse/ARROW-13573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17394062#comment-17394062
 ] 

David Li edited comment on ARROW-13573 at 8/5/21, 2:30 PM:
-----------------------------------------------------------

Also, we can use the 'fast' approach for dictionaries (being able to write into 
slices, using preallocated outputs as implemented for numeric inputs, as 
opposed to the builder-based approach in ARROW-13222 for variable-width types) 
though we'll want to support nested dictionaries too (lists of dictionaries and 
such).


was (Author: lidavidm):
Also, we can use the 'fast' approach for dictionaries (as opposed to the 
builder-based approach in ARROW-13222) though we'll want to support nested 
dictionaries too (lists of dictionaries and such).

> [C++] Support dictionaries directly in case_when kernel
> -------------------------------------------------------
>
>                 Key: ARROW-13573
>                 URL: https://issues.apache.org/jira/browse/ARROW-13573
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: David Li
>            Assignee: David Li
>            Priority: Major
>
> case_when (and other similar kernels) currently dictionary-decode inputs, 
> then operate on the decoded values. This is both inefficient and unexpected. 
> We should instead operate directly on dictionary indices.
> Of course, this introduces more edge cases. If the dictionaries of inputs do 
> not match, we have the following choices:
>  # Raise an error.
>  # Unify the dictionaries.
>  # Use one of the dictionaries, and raise an error if an index of another 
> dictionary cannot be mapped to an index of the chosen dictionary.
>  # Use one of the dictionaries, and emit null if an index of another 
> dictionary cannot be mapped to an index of the chosen dictionary. (This is 
> what base dplyr if_else does with factors.)
> All of these options are reasonable, so we should introduce an options 
> struct. We can implement #3 and #4 at first (to cover R); #2 isn't strictly 
> necessary, as the user can unify the dictionaries manually first, but it may 
> be more efficient to do it this way. Similarly, #1 isn't strictly necessary.
> #3 and #4 are justifiable (beyond just "it's what R does") since users may 
> filter down disjoint dictionaries into a set of common values and then expect 
> to combine the remaining values with a kernel like case_when.
> As described on 
> [GitHub|https://github.com/apache/arrow/pull/10724#discussion_r682671015].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to