alamb opened a new issue, #3090: URL: https://github.com/apache/arrow-rs/issues/3090
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** Storing sorted data in parquet is often a key performance technique as it "clusters" data in interesting ways than can make predicate evaluation and other query techniques faster. The parquet file format contains a way to encode the sortedness of data stored there using a "SortingColumn" in the format https://github.com/apache/parquet-format/blob/54e53e5d7794d383529dd30746378f19a12afd58/src/main/thrift/parquet.thrift#L685-L698 Which is then in the RowGroup metadata: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L829-L832 However, I did not find any code to read/write this metadata yet in the parquet crate https://sourcegraph.com/search?q=context:global+repo:%5Egithub%5C.com/apache/arrow-rs%24+SortingColumn&patternType=standard **Describe the solution you'd like** I would like some way to provide the parquet writer the `SortingColumn` when creating `RowgroupMetadata` Perhaps we could add something to the `WriterProperties` https://docs.rs/parquet/26.0.0/parquet/file/properties/struct.WriterProperties.html Likewise, I would like a way to get the relevant `SortingColumn` list from `RowGroupMetadata`: https://docs.rs/parquet/26.0.0/parquet/file/metadata/struct.RowGroupMetaData.html **Describe alternatives you've considered** It might be worth considering having the parquet writer determine automatically if the data was sorted (maybe this would be better than letting the caller have to verify it)? However, verifying in the writer would likely be a significant performance hit. I also **Additional context** DataFusion is getting more sophisticated in its ability to track and use sortedness information (e.g. https://github.com/apache/arrow-datafusion/pull/4122). If this metadata was included in the parquet file, DataFusion might be able to take more advantage of it (TODO datafusion ticket link) There is more discussion about this topic here https://github.com/apache/arrow-datafusion/issues/4169#issuecomment-1311572149 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
