[
https://issues.apache.org/jira/browse/ARROW-10663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17236064#comment-17236064
]
Joris Van den Bossche edited comment on ARROW-10663 at 11/20/20, 11:10 AM:
---------------------------------------------------------------------------
The original PR adding it seems to have documented a different behaviour at the
time
(https://github.com/apache/arrow/pull/4235/files#diff-fc156499f9e4a75e0ef2d7e83b390f68e833cfa46c164cd0b4542af10a0337e2R35-R36)
("If null occurs in left, if null count in right is not 0, it returns true,
else returns null.").
-But what I don't understand is that this still seems to be tested that way for
the {{IsIn}} function:-
https://github.com/apache/arrow/blob/8b9f6b9d28b4524724e60fac589fb1a3552a32b4/cpp/src/arrow/compute/kernels/scalar_set_lookup_test.cc#L107-L111
Correction: the test I linked to was for only nulls in the left array, and so
that's also how it works in Python (my example above is about nulls in both
left and right array).
So in summary:
- The _default_ behaviour is to return True for nulls if a null is also present
in the right array, and otherwise to return null. So the doc comment in
{{api_scalar.h}} is wrong about when nu nulls in the right are not considered
(skipped or not present).
- In addition, the keyword itself is still ignored (I suppose it was only
implemented for IndexIn, not IsIn)
was (Author: jorisvandenbossche):
The original PR adding it seems to have documented a different behaviour at the
time
(https://github.com/apache/arrow/pull/4235/files#diff-fc156499f9e4a75e0ef2d7e83b390f68e833cfa46c164cd0b4542af10a0337e2R35-R36)
("If null occurs in left, if null count in right is not 0, it returns true,
else returns null.").
-But what I don't understand is that this still seems to be tested that way for
the {{IsIn}} function:-
https://github.com/apache/arrow/blob/8b9f6b9d28b4524724e60fac589fb1a3552a32b4/cpp/src/arrow/compute/kernels/scalar_set_lookup_test.cc#L107-L111
Correction: the test I linked to was for only nulls in the left array, and so
that's also how it works in Python (my example above is about nulls in both
left and right array).
So in summary:
- The _default_ behaviour is to return True for nulls if a null is also present
in the right array, and otherwise to return null. So the doc comment in
{{api_scalar.h}} is wrong about when nu nulls in the right are not considered
(skipped or not present).
- In addition, the keyword itself is still ignored (I suppose it was only
implemented for IndexIn, not IsIn)
> [C++/Doc] The IsIn kernel ignores the skip_nulls option of SetLookupOptions
> ---------------------------------------------------------------------------
>
> Key: ARROW-10663
> URL: https://issues.apache.org/jira/browse/ARROW-10663
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++
> Reporter: Joris Van den Bossche
> Priority: Major
> Fix For: 3.0.0
>
>
> The C++ docs of {{SetLookupOptions}} has this explanation of the
> {{skip_nulls}} option:
> {code}
> /// Whether nulls in `value_set` count for lookup.
> ///
> /// If true, any null in `value_set` is ignored and nulls in the input
> /// produce null (IndexIn) or false (IsIn) values in the output.
> /// If false, any null in `value_set` is successfully matched in
> /// the input.
> bool skip_nulls;
> {code}
> (from
> https://github.com/apache/arrow/blob/8b9f6b9d28b4524724e60fac589fb1a3552a32b4/cpp/src/arrow/compute/api_scalar.h#L78-L84)
> However, for {{IsIn}} this explanation doesn't seem to hold in practice:
> {code}
> In [16]: arr = pa.array([1, 2, None])
> In [17]: pc.is_in(arr, value_set=pa.array([1, None]), skip_null=True)
> Out[17]:
> <pyarrow.lib.BooleanArray object at 0x7fcf666f9408>
> [
> true,
> false,
> true
> ]
> In [18]: pc.is_in(arr, value_set=pa.array([1, None]), skip_null=False)
> Out[18]:
> <pyarrow.lib.BooleanArray object at 0x7fcf666b13a8>
> [
> true,
> false,
> true
> ]
> {code}
> This documentation was added in https://github.com/apache/arrow/pull/7695
> (ARROW-8989)/
> .
> BTW, for "index_in", it works as documented:
> {code}
> In [19]: pc.index_in(arr, value_set=pa.array([1, None]), skip_null=True)
> Out[19]:
> <pyarrow.lib.Int32Array object at 0x7fcf666f04c8>
> [
> 0,
> null,
> null
> ]
> In [20]: pc.index_in(arr, value_set=pa.array([1, None]), skip_null=False)
> Out[20]:
> <pyarrow.lib.Int32Array object at 0x7fcf666f0ee8>
> [
> 0,
> null,
> 1
> ]
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)