alamb commented on issue #4169: URL: https://github.com/apache/arrow-datafusion/issues/4169#issuecomment-1311572149
> One alternative would be to encode the "sorted by" property into the parquet file itself. Sure that's more effort, but I kinda think that it would be nicer for the ecosystem. This metadata would be optional and solely help optimization (although if specified, it must be correct). This is very similar to statistics. Yes, I agree this would be nice if there was some standard way to do so. I poked around in the format definition and it seems like there is a standard way to encode the sort order in each Row Group's metadata: There is a "SortingColumn" in the format https://github.com/apache/parquet-format/blob/54e53e5d7794d383529dd30746378f19a12afd58/src/main/thrift/parquet.thrift#L685-L698 Which is then in the RowGroup metadata: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L829-L832 However, I did not find any code to read/write this in the parquet crate https://sourcegraph.com/search?q=context:global+repo:%5Egithub%5C.com/apache/arrow-rs%24+SortingColumn&patternType=standard I will file some follow on tickets to properly support this in parquet and in datafusion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
