rouault opened a new issue, #40653: URL: https://github.com/apache/arrow/issues/40653
### Describe the bug, including details regarding any error messages, version, and platform. The destruction of the ScannerRecordBatchReader object returned by arrow::dataset::Scanner::ToRecordBatchReader() on a Parquet dataset of 1 GB with ~ 10 million rows and 77 row groups (https://overturemaps-us-west-2.s3.amazonaws.com/release/2024-03-12-alpha.0/theme%3Dbuildings/type%3Dbuilding/part-00000-4dfc75cd-2680-4d52-b5e0-f4cc9f36b267-c000.zstd.parquet) is extremely long, when reading for example just only a few rows, due to SerialIterator::~SerialIterator() iterating until the end of the dataset. It would be desirable that the destruction of the batch reader doesn't trigger such lengthy operations. thread_pool.h has the following comment, but trying to implement that is beyond my understanding of the libarrow/libparquet deep internals: ``` /// Note: The iterator's destructor will run until the given generator is fully /// exhausted. If you wish to abandon iteration before completion then the correct /// approach is to use a stop token to cause the generator to exhaust early. ``` Invoking explicitly the Close() method on the record batch reader doesn't improve performance either. This is the result of the analysis of https://github.com/OSGeo/gdal/issues/9497 Version: libarrow/libparquet from apache-arrow-15.0.0 ### Component(s) C++ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
