[I] Lengthy destruction of ScannerRecordBatchReader [arrow]

via GitHub Mon, 18 Mar 2024 16:49:21 -0700


rouault opened a new issue, #40653:
URL: https://github.com/apache/arrow/issues/40653


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   The destruction of the ScannerRecordBatchReader object returned by 
arrow::dataset::Scanner::ToRecordBatchReader() on a Parquet dataset of 1 GB 
with ~ 10 million rows and 77 row groups 
(https://overturemaps-us-west-2.s3.amazonaws.com/release/2024-03-12-alpha.0/theme%3Dbuildings/type%3Dbuilding/part-00000-4dfc75cd-2680-4d52-b5e0-f4cc9f36b267-c000.zstd.parquet)
 is extremely long, when reading for example just only a few rows, due to 
SerialIterator::~SerialIterator() iterating until the end of the dataset. It 
would be desirable that the destruction of the batch reader doesn't trigger 
such lengthy operations.
   
   thread_pool.h has the following comment, but trying to implement that is 
beyond my understanding of the libarrow/libparquet deep internals:
   ```
     /// Note: The iterator's destructor will run until the given generator is 
fully
     /// exhausted. If you wish to abandon iteration before completion then the 
correct
     /// approach is to use a stop token to cause the generator to exhaust 
early.
   ```
   
   Invoking explicitly the Close() method on the record batch reader doesn't 
improve performance either.
   
   This is the result of the analysis of 
https://github.com/OSGeo/gdal/issues/9497
   
   Version: libarrow/libparquet from apache-arrow-15.0.0
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Lengthy destruction of ScannerRecordBatchReader [arrow]

Reply via email to