mapleFU commented on PR #38784:
URL: https://github.com/apache/arrow/pull/38784#issuecomment-1824489153
After gothrough the test code I know why the regression happens...
Code path:
```
pyarrow.read_table
- pyarrow ParquetFile.read
-- C++ parquet::arrow::ReadTable
--- parquet::arrow::ReadColumn for all column
```
So finally the code goes to:
```
Status ReadColumn(int i, const std::vector<int>& row_groups, ColumnReader*
reader,
std::shared_ptr<ChunkedArray>* out) {
BEGIN_PARQUET_CATCH_EXCEPTIONS
// TODO(wesm): This calculation doesn't make much sense when we have
repeated
// schema nodes
int64_t records_to_read = 0;
for (auto row_group : row_groups) {
// Can throw exception
records_to_read +=
reader_->metadata()->RowGroup(row_group)->ColumnChunk(i)->num_values();
}
return reader->NextBatch(records_to_read, out);
END_PARQUET_CATCH_EXCEPTIONS
}
```
The size of `records_to_read` is the row-number of the file. Than it goes to
`LeafReader::LoadBatch` and `LeadReader::BuildArray`.
`BuildArray` is so simple:
```
::arrow::Status BuildArray(int64_t length_upper_bound,
std::shared_ptr<::arrow::ChunkedArray>* out)
final {
*out = out_;
return Status::OK();
}
```
So we focus on `LoadBatch`. The underlying is
`parquet::ByteArrayChunkedRecordReader`. This would use a extramly use a huge
batch-size
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]