Re: [PR] GH-38432: [C++][Parquet] Trying to Fix regression in the DictByteArrayDecoderImpl [arrow]

via GitHub Thu, 23 Nov 2023 06:02:31 -0800


mapleFU commented on PR #38784:
URL: https://github.com/apache/arrow/pull/38784#issuecomment-1824489153


   After gothrough the test code I know why the regression happens...
   
   Code path:
   
   ```
   pyarrow.read_table
   - pyarrow ParquetFile.read
   -- C++ parquet::arrow::ReadTable
   --- parquet::arrow::ReadColumn for all column
   ```
   
   So finally the code goes to:
   
   ```
     Status ReadColumn(int i, const std::vector<int>& row_groups, ColumnReader* 
reader,
                       std::shared_ptr<ChunkedArray>* out) {
       BEGIN_PARQUET_CATCH_EXCEPTIONS
       // TODO(wesm): This calculation doesn't make much sense when we have 
repeated
       // schema nodes
       int64_t records_to_read = 0;
       for (auto row_group : row_groups) {
         // Can throw exception
         records_to_read +=
             
reader_->metadata()->RowGroup(row_group)->ColumnChunk(i)->num_values();
       }
       return reader->NextBatch(records_to_read, out);
       END_PARQUET_CATCH_EXCEPTIONS
     }
   ```
   
   The size of `records_to_read` is the row-number of the file. Than it goes to 
`LeafReader::LoadBatch` and `LeadReader::BuildArray`.
   
   `BuildArray` is so simple:
   
   ```
     ::arrow::Status BuildArray(int64_t length_upper_bound,
                                std::shared_ptr<::arrow::ChunkedArray>* out) 
final {
       *out = out_;
       return Status::OK();
     }
   ```
   
   So we focus on `LoadBatch`. The underlying is 
`parquet::ByteArrayChunkedRecordReader`. This would use a extramly use a huge 
batch-size


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-38432: [C++][Parquet] Trying to Fix regression in the DictByteArrayDecoderImpl [arrow]

Reply via email to