[GitHub] [arrow-rs] alamb opened a new issue, #3090: Expose `SortingColumn` when reading and writing parquet metadata

GitBox Fri, 11 Nov 2022 07:34:29 -0800


alamb opened a new issue, #3090:
URL: https://github.com/apache/arrow-rs/issues/3090


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   Storing sorted data in parquet is often a key performance technique as it 
"clusters" data in interesting ways than can make predicate evaluation and 
other query techniques faster. 
   
   The parquet file format contains a way to encode the sortedness of data 
stored there using a "SortingColumn" in the format
   
https://github.com/apache/parquet-format/blob/54e53e5d7794d383529dd30746378f19a12afd58/src/main/thrift/parquet.thrift#L685-L698
   
   Which is then in the RowGroup metadata:
   
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L829-L832
    
   However, I did not find any code to read/write this metadata yet in the 
parquet crate
   
https://sourcegraph.com/search?q=context:global+repo:%5Egithub%5C.com/apache/arrow-rs%24+SortingColumn&patternType=standard
   
   
   **Describe the solution you'd like**
   
   I would like some way to provide the parquet writer the `SortingColumn` when 
creating `RowgroupMetadata`
   
   Perhaps we could add something to the `WriterProperties`
   
   
https://docs.rs/parquet/26.0.0/parquet/file/properties/struct.WriterProperties.html
   
   Likewise, I would like a way to get the relevant `SortingColumn` list from 
`RowGroupMetadata`: 
   
https://docs.rs/parquet/26.0.0/parquet/file/metadata/struct.RowGroupMetaData.html
   
   
   **Describe alternatives you've considered**
   It might be worth considering having the parquet writer determine 
automatically if the data was sorted (maybe this would be better than letting 
the caller have to verify it)? However, verifying in the writer would likely be 
a significant performance hit. 
   
   I also 
   
   
   **Additional context**
   DataFusion is getting more sophisticated in its ability to track and use 
sortedness information (e.g. 
https://github.com/apache/arrow-datafusion/pull/4122). If this metadata was 
included in the parquet file, DataFusion might be able to take more advantage 
of it (TODO datafusion ticket link)
   
   
   There is more discussion about this topic here 
https://github.com/apache/arrow-datafusion/issues/4169#issuecomment-1311572149
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] alamb opened a new issue, #3090: Expose `SortingColumn` when reading and writing parquet metadata

Reply via email to