[ 
https://issues.apache.org/jira/browse/ARROW-12554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-12554:
--------------------------------------

    Assignee: Antoine Pitrou

> Allow duplicates in the value_set for compute::is_in  
> ------------------------------------------------------
>
>                 Key: ARROW-12554
>                 URL: https://issues.apache.org/jira/browse/ARROW-12554
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Python
>    Affects Versions: 4.0.0
>            Reporter: niranda perera
>            Assignee: Antoine Pitrou
>            Priority: Major
>             Fix For: 4.0.1
>
>
> In the arrow release-4.0.0 branch, the `compute::is_in` operation rejects 
> duplicate values in the `value_set` [1]. This was not the case in arrow 2.0 
> >=.
>  
> I was wondering if this strict restriction is required? Because ultimately, a 
> hash set would be created from the value_set values, and there's no harm in 
> having duplicates while doing so, isn't it?
> PS: I understand that the param name "value_set" indicates that the values 
> need to be unique, but in the useability perspective, this can be relaxed 
> IMO. ex: Pandas isin [2].
>  
>  
> [1] 
> [https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_set_lookup.cc#L53]
> [2] [https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isin.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to