wjones127 commented on PR #35453: URL: https://github.com/apache/arrow/pull/35453#issuecomment-1539066125
> Though one challenge would be that a single file doesn't necessarily tell you about an entire dataset. For example, if the file foo.parquet is sorted by "date" then are all files in that dataset sorted by date? Does foo.parquet come before bar.parquet? Yeah, if you wanted to do that, you'd have to combine it with other metadata to get the sorting relationship between partitions and within partitions. Between partitions, you could sort the partition values and prepend that to the sorting order. For within partitions, you could aggregate the row group stats to file-level stats, and use those to see if they combine to an overall ordering. So, for example, a dataset like: ``` part=A/file1.parquet (min x: 1, max x: 10) part=A/file1.parquet (min x: 10, max x: 20) part=B/file1.parquet (min x: 5, max x: 20) ``` If each of the files returned a sort order of `x, ascending`, then you could construct the dataset to have an overall ordering of `[(part, ascending), (x, ascending)]`. Honestly, it would be nice just to be able to get a dataset to be considered "ordered" by its partition values. For example, if I had two tables partitioned by date, it would be nice to be able to join them via a sort merge join on that date column while using the inherent sorting of the partition values. (Having flashbacks to some frustrations in my days working with PySpark.) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
