ei-grad opened a new issue, #35331:
URL: https://github.com/apache/arrow/issues/35331

   ### Describe the enhancement requested
   
   ## Summary
   
   Currently, the `pyarrow.parquet.RowGroupMetaData` class does not expose the 
`sorting_columns` information available in the Parquet format's `RowGroup` 
struct. This information is useful for users who need to understand the local 
sorting order of columns within each RowGroup. It would be beneficial to expose 
this information in the `RowGroupMetaData` class.
   
   ## Details
   
   The Parquet format includes an optional `sorting_columns` field in the 
`RowGroup` struct, which stores information about the sorting order of columns 
within the RowGroup. This information is defined in the `SortingColumn` struct 
in the `parquet.thrift` file:
   
   ```
   struct SortingColumn {
     1: required int32 column_idx;
     2: required bool descending;
     3: optional bool nulls_first;
   }
   ```
   
   In the `RowGroup` struct, the `sorting_columns` field is defined as follows:
   
   ```
   struct RowGroup {
     1: required list<ColumnChunk> columns;
     2: required i64 total_byte_size;
     3: required i64 num_rows;
     4: optional list<SortingColumn> sorting_columns;
   }
   ```
   
   However, the `pyarrow.parquet.RowGroupMetaData` class does not expose this 
information. As a result, users cannot access the local sorting information of 
columns within RowGroups.
   
   ## Proposal
   
   I propose adding a new method or property in the `RowGroupMetaData` class to 
expose the `sorting_columns` information. This could be implemented as a new 
method, such as `get_sorting_columns()`, or as a property, such as 
`sorting_columns`. The output should include the column index, sorting order 
(ascending or descending), and whether null values appear first or last in the 
sorted order.
   
   ## Use Case
   
   Users working with sorted Parquet files can benefit from understanding the 
local sorting order of columns within RowGroups. This information is 
particularly useful when analyzing large datasets or performing operations that 
require knowledge of the sort order, such as range queries, filtering, or 
merging.
   
   By exposing the `sorting_columns` information in the `RowGroupMetaData` 
class, users can more easily work with sorted Parquet files and perform 
advanced data processing operations.
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to