This is an automated email from the ASF dual-hosted git repository.

westonpace pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/main by this push:
     new 5e8db3156c GH-34653: [CI][C++] Fix for arrow-dataset-file-json-test 
segfault on alpine-linux-cpp (#35047)
5e8db3156c is described below

commit 5e8db3156c733a31e196683011db113e76ce6a32
Author: Ben Harkins <[email protected]>
AuthorDate: Wed Apr 12 09:01:29 2023 -0400

    GH-34653: [CI][C++] Fix for arrow-dataset-file-json-test segfault on 
alpine-linux-cpp (#35047)
    
    ### What changes are included in this PR?
    
    Increases the block size used in the `ScanWithParallelDecoding` test to 
reduce the number of (potentially parallel) parsing/decoding jobs from 1000+ to 
roughly 60 while increasing the runtime of each job. This should still satisfy 
the purpose of test without going completely over the top.
    
    ### Are these changes tested?
    
    Yes, tested locally on the alpine docker image many times after 
successfully reproducing the original issue.
    
    ### Are there any user-facing changes?
    
    No
    
    ### Notes
    
    This doesn't solve the underlying cause (although the testing parameters 
were arguably far too unusual in the first place), however I do believe that 
I've identified the issue via a core dump.
    
    The problem starts 
[here](https://github.com/apache/arrow/blob/47a602dbd9b7b7f7720a5e62467e3e6c61712cf3/cpp/src/arrow/json/reader.cc#L362-L369),
 where a `MappingGenerator` gets stacked on top of a generator that applies 
readahead. It seems that the underlying futures were completing very quickly, 
resulting in `AddCallback` being called recursively many, many times - starting 
[here](https://github.com/apache/arrow/blob/47a602dbd9b7b7f7720a5e62467e3e6c61712cf3/cpp/src/arrow/util/async_g
 [...]
    
    So, to fully guard against the problem, you'd probably want to change the 
logic of `MappingGenerator` to use `TryAddCallback` + an inner loop to avoid 
overflowing the stack. Not entirely sure if doing this would be worthwhile 
though.
    * Closes: #34653
    
    Authored-by: benibus <[email protected]>
    Signed-off-by: Weston Pace <[email protected]>
---
 cpp/src/arrow/dataset/file_json_test.cc | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/cpp/src/arrow/dataset/file_json_test.cc 
b/cpp/src/arrow/dataset/file_json_test.cc
index f3ea1941da..b34b5cddf1 100644
--- a/cpp/src/arrow/dataset/file_json_test.cc
+++ b/cpp/src/arrow/dataset/file_json_test.cc
@@ -233,7 +233,8 @@ class JsonScanMixin {
     // inter-fragment parallelism (when threading is enabled).
     JsonFragmentScanOptions json_options;
     json_options.read_options.use_threads = true;
-    json_options.read_options.block_size = 256;
+    // Should amount to roughly 60 blocks per fragment.
+    json_options.read_options.block_size = 1 << 13;
     this_->SetJsonOptions(std::move(json_options));
     this_->TestScan();
   }

Reply via email to