viirya commented on code in PR #3930: URL: https://github.com/apache/datafusion-comet/pull/3930#discussion_r3070656227
########## native/core/src/execution/batch_stash.rs: ########## @@ -0,0 +1,121 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! Global registry for passing RecordBatch values between native execution contexts +//! via opaque u64 handles, without Arrow FFI serialization. + +use arrow::record_batch::RecordBatch; +use once_cell::sync::Lazy; +use std::collections::HashMap; +use std::sync::atomic::{AtomicU64, Ordering}; +use std::sync::Mutex; + +/// Counter for generating unique handles. +static NEXT_HANDLE: AtomicU64 = AtomicU64::new(1); + +/// Global stash mapping handles to RecordBatch values. +static STASH: Lazy<Mutex<HashMap<u64, RecordBatch>>> = Lazy::new(|| Mutex::new(HashMap::new())); + +/// Store a RecordBatch in the global stash and return a unique handle. +pub(crate) fn stash(batch: RecordBatch) -> u64 { + let handle = NEXT_HANDLE.fetch_add(1, Ordering::Relaxed); + STASH + .lock() + .expect("batch_stash lock poisoned") Review Comment: Probably return an error instead of panic? i.e., `lock().unwrap_or_else(|e| e.into_inner())`. ########## native/core/src/execution/operators/scan.rs: ########## @@ -532,6 +574,21 @@ impl Stream for ScanStream<'_> { let maybe_batch = self.build_record_batch(columns, *num_rows); Poll::Ready(Some(maybe_batch)) } + InputBatch::Complete(batch) => { + self.baseline_metrics.record_output(batch.num_rows()); + let columns = batch.columns(); + let num_rows = batch.num_rows(); + if columns.len() == self.schema.fields().len() { + // Column counts match. Use build_record_batch to handle any + // type differences (e.g., timestamp timezone casting). + let maybe_batch = self.build_record_batch(columns, num_rows); Review Comment: Why the batch is complete, why need to build_record_batch? ``` /// A complete RecordBatch retrieved from the BatchStash. Bypasses /// `build_record_batch` since the batch is already fully formed. Complete(RecordBatch), ``` ########## native/core/src/execution/operators/scan.rs: ########## @@ -532,6 +574,21 @@ impl Stream for ScanStream<'_> { let maybe_batch = self.build_record_batch(columns, *num_rows); Poll::Ready(Some(maybe_batch)) } + InputBatch::Complete(batch) => { + self.baseline_metrics.record_output(batch.num_rows()); + let columns = batch.columns(); + let num_rows = batch.num_rows(); + if columns.len() == self.schema.fields().len() { + // Column counts match. Use build_record_batch to handle any + // type differences (e.g., timestamp timezone casting). + let maybe_batch = self.build_record_batch(columns, num_rows); Review Comment: And in shuffle_scan.rs, Complete batch is returned directly. It looks inconsistent. ########## native/core/src/execution/batch_stash.rs: ########## @@ -0,0 +1,121 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! Global registry for passing RecordBatch values between native execution contexts +//! via opaque u64 handles, without Arrow FFI serialization. + +use arrow::record_batch::RecordBatch; +use once_cell::sync::Lazy; +use std::collections::HashMap; +use std::sync::atomic::{AtomicU64, Ordering}; +use std::sync::Mutex; + +/// Counter for generating unique handles. +static NEXT_HANDLE: AtomicU64 = AtomicU64::new(1); + +/// Global stash mapping handles to RecordBatch values. +static STASH: Lazy<Mutex<HashMap<u64, RecordBatch>>> = Lazy::new(|| Mutex::new(HashMap::new())); Review Comment: Does this must be a global one? Any leak risk? Do we need some cleanup to remove the content? ########## native/core/src/execution/operators/scan.rs: ########## @@ -532,6 +574,21 @@ impl Stream for ScanStream<'_> { let maybe_batch = self.build_record_batch(columns, *num_rows); Poll::Ready(Some(maybe_batch)) } + InputBatch::Complete(batch) => { + self.baseline_metrics.record_output(batch.num_rows()); + let columns = batch.columns(); + let num_rows = batch.num_rows(); + if columns.len() == self.schema.fields().len() { + // Column counts match. Use build_record_batch to handle any + // type differences (e.g., timestamp timezone casting). + let maybe_batch = self.build_record_batch(columns, num_rows); + Poll::Ready(Some(maybe_batch)) + } else { + // Column count mismatch (e.g., empty schema scan). + // Return the stashed batch as-is since it's already valid. + Poll::Ready(Some(Ok(batch.clone()))) Review Comment: It might be more clear in semantics here to `take` it instead of `clone`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
