[GitHub] [arrow] westonpace commented on pull request #35453: GH-35331: [Python] Expose Parquet sorting metadata

via GitHub Tue, 09 May 2023 08:56:21 -0700


westonpace commented on PR #35453:
URL: https://github.com/apache/arrow/pull/35453#issuecomment-1540452046


   > Honestly, it would be nice just to be able to get a dataset to be 
considered "ordered" by its partition values. For example, if I had two tables 
partitioned by date, it would be nice to be able to join them via a sort merge 
join (or maybe asof join?) on that date column while using the inherent sorting 
of the partition values. (Having flashbacks to some frustrations in my days 
working with PySpark.)
   
   We are quite close to being able to do this.  In particular, I think we are 
very close to being able to take a query like "SELECT MEAN(x) FROM dataset 
GROUP BY part" and, instead of having aggregation be a pipeline breaker, have 
it scan values in partition order and emit values as soon as a partition is 
exhausted.  Here is what we have, and what is missing:
   
    * DONE: ExecBatch has an `index` property to determine its sorting.
    * DONE: ExecNode has an `ordering` property which says which columns its 
output are ordered by
    * DONE: The scan node has the ability to scan a dataset in order, this 
means the AsyncGenerator<ExecBatch> will emit exec batches in a deterministic 
order (filename, batch_in_file)
    * NEEDED: The scan node needs to detect if the dataset is partitioned, ask 
the dataset for fragments ordered by partition, and then assign its `ordering` 
property based on the partition columns (this sounds harder than it is)
    * DONE: The aggregate node is able to define "segment keys" which, if 
given, it will use to break up the output
    * NEEDED: The aggregate node's segment keys currently only work if run in 
single threaded mode.  We need to support parallelism.  This should be easily 
obtainable by using a sequencing queue
    * NOTE: The above assumes that the user intelligently crafts the plan (e.g. 
they know to assign "part" as a segment key when creating the aggregate node).  
In reality, this is something that would happen in a planner (it would notice 
that "part" is a partition key, figure out that it can configure the scan to 
emit in sequence, and then use part when creating the aggregate node).  This 
means this feature will need to be exploited by reasonable intelligent users.  
However, this is in line with Acero being a dumb execution plan without a 
planner.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on pull request #35453: GH-35331: [Python] Expose Parquet sorting metadata

Reply via email to