ei-grad opened a new issue, #35331:
URL: https://github.com/apache/arrow/issues/35331
### Describe the enhancement requested
## Summary
Currently, the `pyarrow.parquet.RowGroupMetaData` class does not expose the
`sorting_columns` information available in the Parquet format's `RowGroup`
struct. This information is useful for users who need to understand the local
sorting order of columns within each RowGroup. It would be beneficial to expose
this information in the `RowGroupMetaData` class.
## Details
The Parquet format includes an optional `sorting_columns` field in the
`RowGroup` struct, which stores information about the sorting order of columns
within the RowGroup. This information is defined in the `SortingColumn` struct
in the `parquet.thrift` file:
```
struct SortingColumn {
1: required int32 column_idx;
2: required bool descending;
3: optional bool nulls_first;
}
```
In the `RowGroup` struct, the `sorting_columns` field is defined as follows:
```
struct RowGroup {
1: required list<ColumnChunk> columns;
2: required i64 total_byte_size;
3: required i64 num_rows;
4: optional list<SortingColumn> sorting_columns;
}
```
However, the `pyarrow.parquet.RowGroupMetaData` class does not expose this
information. As a result, users cannot access the local sorting information of
columns within RowGroups.
## Proposal
I propose adding a new method or property in the `RowGroupMetaData` class to
expose the `sorting_columns` information. This could be implemented as a new
method, such as `get_sorting_columns()`, or as a property, such as
`sorting_columns`. The output should include the column index, sorting order
(ascending or descending), and whether null values appear first or last in the
sorted order.
## Use Case
Users working with sorted Parquet files can benefit from understanding the
local sorting order of columns within RowGroups. This information is
particularly useful when analyzing large datasets or performing operations that
require knowledge of the sort order, such as range queries, filtering, or
merging.
By exposing the `sorting_columns` information in the `RowGroupMetaData`
class, users can more easily work with sorted Parquet files and perform
advanced data processing operations.
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]