[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #35036: GH-34315: [C++] Correct is_null kernel for Union and RunEndEncoded logical nulls

via GitHub Wed, 31 May 2023 08:07:42 -0700


jorisvandenbossche commented on code in PR #35036:
URL: https://github.com/apache/arrow/pull/35036#discussion_r1211872176



##########
cpp/src/arrow/compute/kernels/scalar_validity.cc:
##########
@@ -82,6 +84,72 @@ static void SetNanBits(const ArraySpan& arr, uint8_t* 
out_bitmap, int64_t out_of
   }
 }
 
+static void SetSparseUnionLogicalNullBits(const ArraySpan& span, uint8_t* 
out_bitmap,
+                                          int64_t out_offset) {
+  const auto* sparse_union_type =
+      ::arrow::internal::checked_cast<const SparseUnionType*>(span.type);
+  DCHECK_LE(span.child_data.size(), 128);
+
+  const int8_t* types = span.GetValues<int8_t>(1);  // NOLINT
+  for (int64_t i = 0; i < span.length; i++) {
+    const int8_t child_id = sparse_union_type->child_ids()[types[i]];
+    if (span.child_data[child_id].IsNull(i + span.offset)) {
+      bit_util::SetBit(out_bitmap, i + out_offset);
+    }
+  }
+}
+
+static void SetDenseUnionLogicalNullBits(const ArraySpan& span, uint8_t* 
out_bitmap,
+                                         int64_t out_offset) {
+  const auto* dense_union_type =
+      ::arrow::internal::checked_cast<const DenseUnionType*>(span.type);
+  DCHECK_LE(span.child_data.size(), 128);
+
+  const int8_t* types = span.GetValues<int8_t>(1);      // NOLINT
+  const int32_t* offsets = span.GetValues<int32_t>(2);  // NOLINT
+  for (int64_t i = 0; i < span.length; i++) {
+    const int8_t child_id = dense_union_type->child_ids()[types[i]];
+    const int32_t offset = offsets[i];
+    if (span.child_data[child_id].IsNull(offset)) {
+      bit_util::SetBit(out_bitmap, i + out_offset);
+    }
+  }
+}
+
+template <typename RunEndCType>
+void SetREELogicalNullBits(const ArraySpan& span, uint8_t* out_bitmap,
+                           int64_t out_offset) {
+  const auto& values = arrow::ree_util::ValuesArray(span);
+  DCHECK(!is_nested(values.type->id()));

Review Comment:
   > you could use another strategy:
   
   That would indeed be a good alternative and would be more robust for 
whathever type is used for the REE values. I did a quick benchmark comparing 
both strategies in python:
   
   ```
   In [2]: run_lengths = np.random.randint(1, 10, 100_000)
   
   In [3]: run_values = [1, 2, 3, 4, None] * 20000
   
   In [4]: arr = pa.RunEndEncodedArray.from_arrays(run_lengths.cumsum(), 
run_values)
   
   In [5]: res1 = pc.is_null(arr)
   
   In [6]: res2 = 
pc.run_end_decode(pa.RunEndEncodedArray.from_arrays(np.asarray(arr.run_ends), 
pc.is_null(arr.values)))
   
   In [7]: res1.equals(res2)
   Out[7]: True
   
   In [8]: %timeit pc.is_null(arr)
   309 µs ± 843 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
   
   In [9]: %timeit 
pc.run_end_decode(pa.RunEndEncodedArray.from_arrays(np.asarray(arr.run_ends), 
pc.is_null(arr.values)))
   1.07 ms ± 17.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
   ```
   
   This is running with this branch (in release mode), so `pc.is_null` is using 
this PR's implementation, and the other is the python equivalent of what you 
propose (IIUC). 
   The alternative seems significantly slower, although I don't know how much 
of that is due to overhead of going through python several times.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #35036: GH-34315: [C++] Correct is_null kernel for Union and RunEndEncoded logical nulls

Reply via email to