[GitHub] [arrow] wjones127 commented on pull request #35453: GH-35331: [Python] Expose Parquet sorting metadata

via GitHub Mon, 08 May 2023 14:19:26 -0700


wjones127 commented on PR #35453:
URL: https://github.com/apache/arrow/pull/35453#issuecomment-1539066125


   > Though one challenge would be that a single file doesn't necessarily tell 
you about an entire dataset. For example, if the file foo.parquet is sorted by 
"date" then are all files in that dataset sorted by date? Does foo.parquet come 
before bar.parquet?
   
   Yeah, if you wanted to do that, you'd have to combine it with other metadata 
to get the sorting relationship between partitions and within partitions. 
Between partitions, you could sort the partition values and prepend that to the 
sorting order. For within partitions, you could aggregate the row group stats 
to file-level stats, and use those to see if they combine to an overall 
ordering.
   
   So, for example, a dataset like:
   
   ```
   part=A/file1.parquet (min x: 1, max x: 10)
   part=A/file1.parquet (min x: 10, max x: 20)
   part=B/file1.parquet (min x: 5, max x: 20)
   ```
   
   If each of the files returned a sort order of `x, ascending`, then you could 
construct the dataset to have an overall ordering of `[(part, ascending), (x, 
ascending)]`.
   
   Honestly, it would be nice just to be able to get a dataset to be considered 
"ordered" by its partition values. For example, if I had two tables partitioned 
by date, it would be nice to be able to join them via a sort merge join on that 
date column while using the inherent sorting of the partition values. (Having 
flashbacks to some frustrations in my days working with PySpark.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] wjones127 commented on pull request #35453: GH-35331: [Python] Expose Parquet sorting metadata

Reply via email to