nirandaperera commented on a change in pull request #10802:
URL: https://github.com/apache/arrow/pull/10802#discussion_r681375590
##########
File path: cpp/src/arrow/compute/kernels/vector_selection.cc
##########
@@ -2146,6 +2147,144 @@ class TakeMetaFunction : public MetaFunction {
}
};
+// ----------------------------------------------------------------------
+// DropNull Implementation
+
+Result<std::shared_ptr<arrow::Array>> GetNotNullIndices(
+ const std::shared_ptr<Array>& column, MemoryPool* memory_pool) {
+ std::shared_ptr<arrow::Array> indices;
+ arrow::NumericBuilder<arrow::Int32Type> builder(memory_pool);
+ builder.Reserve(column->length() - column->null_count());
+
+ std::vector<int32_t> values;
+ for (int64_t i = 0; i < column->length(); i++) {
+ if (column->IsValid(i)) {
+ builder.UnsafeAppend(static_cast<int32_t>(i));
+ }
+ }
+ RETURN_NOT_OK(builder.Finish(&indices));
+ return indices;
+}
+
+Result<std::shared_ptr<arrow::Array>> GetNotNullIndices(
+ const std::shared_ptr<ChunkedArray>& chunks, MemoryPool* memory_pool) {
+ std::shared_ptr<arrow::Array> indices;
+ arrow::NumericBuilder<arrow::Int32Type> builder(memory_pool);
+ builder.Reserve(chunks->length() - chunks->null_count());
+ int64_t relative_index = 0;
+ for (int64_t chunk_index = 0; chunk_index < chunks->num_chunks();
++chunk_index) {
+ auto column_chunk = chunks->chunk(chunk_index);
+ for (int64_t col_index = 0; col_index < column_chunk->length();
col_index++) {
+ if (column_chunk->IsValid(col_index)) {
+ builder.UnsafeAppend(static_cast<int32_t>(relative_index + col_index));
+ }
+ }
+ relative_index += column_chunk->length();
+ }
+ RETURN_NOT_OK(builder.Finish(&indices));
+ return indices;
+}
+
+Result<std::shared_ptr<RecordBatch>> DropNullRecordBatch(const RecordBatch&
batch,
Review comment:
I see...
so then another approach would be to do a `internal::CopyBitmap` and inplace
`internal::BitmapAnd` on all validity buffers (non-nullptr) and do a selection.
But I am not sure if that would be faster than the `std::set` approach. My gut
feeling is it would be, because we dont need to do any searches on the
`std::set` (and inserts, both of which are log N) before creating the take
array, and std::set nodes are allocated dynamically (which could have cache
inefficiencies). Bitmap approach works on contiguous memory and ops would get
vectorized.
Ref:
https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/bitmap_ops.h
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]