edponce commented on a change in pull request #10896:
URL: https://github.com/apache/arrow/pull/10896#discussion_r685586159



##########
File path: cpp/src/arrow/compute/kernels/scalar_validity.cc
##########
@@ -76,11 +79,32 @@ struct IsInfOperator {
 
 struct IsNullOperator {
   static Status Call(KernelContext* ctx, const Scalar& in, Scalar* out) {
-    checked_cast<BooleanScalar*>(out)->value = !in.is_valid;
+    auto options = OptionsWrapper<NanNullOptions>::Get(ctx);
+    bool* out_value = &checked_cast<BooleanScalar*>(out)->value;
+    if (in.is_valid) {
+      switch (in.type->id()) {
+        case Type::FLOAT:
+          *out_value = options.nan_is_null &&
+                       std::isnan(internal::UnboxScalar<FloatType>::Unbox(in));
+          break;
+        case Type::DOUBLE:
+          *out_value = options.nan_is_null &&
+                       
std::isnan(internal::UnboxScalar<DoubleType>::Unbox(in));
+          break;
+        default:
+          *out_value = false;
+      }
+    } else {
+      *out_value = true;
+    }
+
     return Status::OK();
   }
 
   static Status Call(KernelContext* ctx, const ArrayData& arr, ArrayData* out) 
{
+    // TODO: Is `options` needed for detect nulls? Which is the better way to
+    // handle is_null for ArrayData
+    auto options = OptionsWrapper<NanNullOptions>::Get(ctx);

Review comment:
       **Note the check for `NaN` values only applies to floating-point types 
and when `nan_is_null` is `true`, other types/cases can use the logic as it 
was.**
   
   In the case of `ArrayData` there are 3 scenarios:
   a) arr.GetNullCount() == arr.length  // All data is null
   b) arr.GetNullCount() == 0               // No data is null
   c) arr.MayHaveNulls() == true         // Some data is null
   
   IIUC, the null bitmap of the input `ArrayData` is not guaranteed to be 
consistent with the data ([an `ArrayData` can be 
malformed](https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/data.h#L226-L230)
 bc buffer values can be modified directly). Scenarios *(a)-(b)* invoke 
`arr.GetNullCount()` which iterates through all the `arr` values and updates 
the null bitmap accordingly.
   
   Given that scenarios *(b)-(c)* are the common case and the array data has to 
be traversed to identify the `NaN` values, (as an optimization) I suggest to 
not check the null count at all. Nevertheless, only check for `NaN`s in 
non-null indices.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to