Re: [I] [C++] Metadata related memory leak when reading parquet dataset [arrow]

via GitHub Mon, 27 Jan 2025 11:55:34 -0800


timothydijamco commented on issue #45287:
URL: https://github.com/apache/arrow/issues/45287#issuecomment-2616738318


   ahh I see makes sense
   
   > How many columns and chunks does your reproducer have?
   
   My repro run was scanning a dataset with 260 .parquet files, each with 1 row 
and 10,000 columns. Each file contains one row group, so I think that means the 
dataset contains  `260 * 10,000 = 2,600,000 Parquet column chunks`. I 
configured the scan to use a 1-column selection so I think it should be reading 
Parquet data pages for 260 Parquet column chunks.
   
   Looking at the "Arrow chunk" side of things, I'm not sure how many Arrow 
chunks the data materializes to -- I also just remembered that in my repro I'm 
iterating over batches of the data instead of accumulating the read data into a 
table so theoretically it shouldn't accumulate overhead data attached to Arrow 
column chunk objects?
   ```
     ARROW_ASSIGN_OR_RAISE(auto record_batch_reader, 
scanner->ToRecordBatchReader());
     std::shared_ptr<arrow::RecordBatch> batch;
     while (true) {
       ARROW_RETURN_NOT_OK(record_batch_reader->ReadNext(&batch));
       if (batch == nullptr) {
           break;
       }
     }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [C++] Metadata related memory leak when reading parquet dataset [arrow]

Reply via email to