DJ-Greenwood commented on issue #39920:
URL: https://github.com/apache/arrow/issues/39920#issuecomment-1925030988
I asked Chat GPT about this and this is what I got back.
this is an open issue for Pyarrow help me fixit:
diff --git a/cpp/src/arrow/dataset/scanner.cc
b/cpp/src/arrow/dataset/scanner.cc
index 18981d145..87e1b2c92 100644
--- a/cpp/src/arrow/dataset/scanner.cc
+++ b/cpp/src/arrow/dataset/scanner.cc
@@ -466,7 +466,7 @@ Result<EnumeratedRecordBatchGenerator>
AsyncScanner::ScanBatchesUnorderedAsync(
std::shared_ptr<void> stop_producing{
nullptr, [plan, exec_context](...) {
bool not_finished_yet = plan->finished().TryAddCallback(
- [&plan, &exec_context] { return [plan, exec_context](const
Status&) {}; });
+ [plan, exec_context] { return [plan, exec_context](const
Status&) {}; });
if (not_finished_yet) {
plan->StopProducing();
ChatGPT
The diff snippet you've shared appears to be a modification in the Arrow C++
library, specifically within the scanner.cc file of the Arrow dataset module.
This modification changes how a lambda function captures plan and exec_context
variables in the context of an asynchronous operation.
In the original code:
cpp
Copy code
[&plan, &exec_context] { return [plan, exec_context](const Status&) {}; });
plan and exec_context are captured by reference in the outer lambda and by
copy in the inner lambda.
In the modified code:
cpp
Copy code
[plan, exec_context] { return [plan, exec_context](const Status&) {}; });
Both plan and exec_context are captured by copy in the outer lambda, which
then captures them by copy again in the inner lambda.
The issue with the original code could be related to the lifetime of the
captured references. Capturing by reference in an asynchronous context can be
dangerous if the referenced objects (plan and exec_context) go out of scope
before the lambda is executed, leading to undefined behavior.
The modification ensures that plan and exec_context are captured by copy,
extending their lifetimes for the duration of the lambda's existence. This is
safer in asynchronous code, preventing potential dangling references.
Possible Issue
The modification aims to resolve a potential issue with dangling references
by ensuring plan and exec_context are captured by value, thus owned by the
lambda and ensuring they remain valid as long as the lambda exists. However,
capturing by value can increase memory usage if these objects are large or
expensive to copy. In this specific case, since both plan and exec_context are
pointers (or shared pointers), the overhead is minimal, and the safety benefits
outweigh the minor increase in reference count.
Fix Explanation
By changing the capture list of the lambda from [&plan, &exec_context] to
[plan, exec_context], you ensure that the shared pointers plan and exec_context
are copied into the lambda, increasing their reference count. This prevents the
objects they point to from being destroyed even if the original shared pointers
go out of scope, which is critical for asynchronous execution where the
completion of tasks can outlive the scope in which they were initiated.
Applying the Fix
If you're working on a fork of the Arrow project or have a local clone, you
can apply this patch by saving it to a file and using the git apply command, or
directly modifying the scanner.cc file as shown in your diff snippet, then
rebuilding the project.
Make sure to run Arrow's unit tests to verify that this change doesn't
introduce any regressions. If you're contributing this fix back to the project,
consider opening a pull request with a clear explanation of the problem and how
your change resolves it, including any relevant details about why capturing by
value is necessary in this asynchronous context.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]