alamb commented on issue #4169:
URL: 
https://github.com/apache/arrow-datafusion/issues/4169#issuecomment-1311572149

   >  One alternative would be to encode the "sorted by" property into the 
parquet file itself. Sure that's more effort, but I kinda think that it would 
be nicer for the ecosystem. This metadata would be optional and solely help 
optimization (although if specified, it must be correct). This is very similar 
to statistics.
   
   Yes, I agree this would be nice if there was some standard way to do so. 
   
   I poked around in the format definition and it seems like there is a 
standard way to encode the sort order in each Row Group's metadata:
   
   There is a "SortingColumn" in the format
   
https://github.com/apache/parquet-format/blob/54e53e5d7794d383529dd30746378f19a12afd58/src/main/thrift/parquet.thrift#L685-L698
   
   Which is then in the RowGroup metadata:
   
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L829-L832
   
   However, I did not find any code to read/write this in the parquet crate
   
https://sourcegraph.com/search?q=context:global+repo:%5Egithub%5C.com/apache/arrow-rs%24+SortingColumn&patternType=standard
   
   I will file some follow on tickets to properly support this in parquet and 
in datafusion. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to