[GitHub] [arrow] benibus opened a new pull request, #35047: GH-34653: [CI][C++] Fix for arrow-dataset-file-json-test segfault on alpine-linux-cpp

via GitHub Tue, 11 Apr 2023 10:36:33 -0700


benibus opened a new pull request, #35047:
URL: https://github.com/apache/arrow/pull/35047

### What changes are included in this PR?

Increases the block size used in the `ScanWithParallelDecoding` test to
reduce the number of (potentially parallel) parsing/decoding jobs from 1000+ to
roughly 60 while increasing the runtime of each job. This should still satisfy
the purpose of test without going completely over the top.

### Are these changes tested?

Yes, tested locally on the alpine docker image many times after successfully
reproducing the original issue.

### Are there any user-facing changes?

### Notes

This doesn't solve the underlying cause (although the testing parameters
were arguably far too unusual in the first place), however I do believe that
I've identified the issue via a core dump.

The problem starts
[here](https://github.com/apache/arrow/blob/47a602dbd9b7b7f7720a5e62467e3e6c61712cf3/cpp/src/arrow/json/reader.cc#L362-L369),
where a `MappingGenerator` gets stacked on top of a generator that applies
readahead. It seems that the underlying futures were completing very quickly,
resulting in `AddCallback` being called recursively many, many times - starting
[here](https://github.com/apache/arrow/blob/47a602dbd9b7b7f7720a5e62467e3e6c61712cf3/cpp/src/arrow/util/async_generator.h#L240).
This leads to a stack overflow under specific circumstances.

So, to fully guard against the problem, you'd probably want to change the
logic of `MappingGenerator` to use `TryAddCallback` + an inner loop to avoid
overflowing the stack. Not entirely sure if doing this would be worthwhile
though.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] benibus opened a new pull request, #35047: GH-34653: [CI][C++] Fix for arrow-dataset-file-json-test segfault on alpine-linux-cpp

Reply via email to