nirandaperera commented on a change in pull request #10802:
URL: https://github.com/apache/arrow/pull/10802#discussion_r681375590



##########
File path: cpp/src/arrow/compute/kernels/vector_selection.cc
##########
@@ -2146,6 +2147,144 @@ class TakeMetaFunction : public MetaFunction {
   }
 };
 
+// ----------------------------------------------------------------------
+// DropNull Implementation
+
+Result<std::shared_ptr<arrow::Array>> GetNotNullIndices(
+    const std::shared_ptr<Array>& column, MemoryPool* memory_pool) {
+  std::shared_ptr<arrow::Array> indices;
+  arrow::NumericBuilder<arrow::Int32Type> builder(memory_pool);
+  builder.Reserve(column->length() - column->null_count());
+
+  std::vector<int32_t> values;
+  for (int64_t i = 0; i < column->length(); i++) {
+    if (column->IsValid(i)) {
+      builder.UnsafeAppend(static_cast<int32_t>(i));
+    }
+  }
+  RETURN_NOT_OK(builder.Finish(&indices));
+  return indices;
+}
+
+Result<std::shared_ptr<arrow::Array>> GetNotNullIndices(
+    const std::shared_ptr<ChunkedArray>& chunks, MemoryPool* memory_pool) {
+  std::shared_ptr<arrow::Array> indices;
+  arrow::NumericBuilder<arrow::Int32Type> builder(memory_pool);
+  builder.Reserve(chunks->length() - chunks->null_count());
+  int64_t relative_index = 0;
+  for (int64_t chunk_index = 0; chunk_index < chunks->num_chunks(); 
++chunk_index) {
+    auto column_chunk = chunks->chunk(chunk_index);
+    for (int64_t col_index = 0; col_index < column_chunk->length(); 
col_index++) {
+      if (column_chunk->IsValid(col_index)) {
+        builder.UnsafeAppend(static_cast<int32_t>(relative_index + col_index));
+      }
+    }
+    relative_index += column_chunk->length();
+  }
+  RETURN_NOT_OK(builder.Finish(&indices));
+  return indices;
+}
+
+Result<std::shared_ptr<RecordBatch>> DropNullRecordBatch(const RecordBatch& 
batch,

Review comment:
       I see... 
   so then another approach would be to do a `internal::CopyBitmap` and inplace 
`internal::BitmapAnd` on all validity buffers (non-nullptr) and do a selection. 
But I am not sure if that would be faster than the `std::set` approach. My gut 
feeling is it would be, because we dont need to do any searches on the 
`std::set` (and inserts, both of which are log N) before creating the take 
array, and std::set nodes are allocated dynamically (which could have cache 
inefficiencies). Bitmap approach works on contiguous memory and ops would get 
vectorized. 
   
   Ref:
   https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/bitmap_ops.h




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to