andygrove commented on pull request #8029: URL: https://github.com/apache/arrow/pull/8029#issuecomment-678784408
> This is an impressive simplification and improvement. Really great work, @andygrove ! > > I went through it and could not find any issue with it, only benefits. > > To make sure I understood this: > > * scans infer the number of partitions based on the number of files they scan > > * the number of partitions is annotated in `Partitioning::UnknownPartitioning(N)` > > * the number of partitions is exposed to others via `output_partitioning` > > * others use `output_partitioning` to know up to which partition number they need to run against to run through all input partitions. > > > The invariant is that the maximum number that `execute(...)` accepts equals to the number exposed in `output_partitioning`, i.e. if a plan passes through that number, a `panic!` happens. Yes, your understanding is correct on all of these points. > I could think of an alternative approach where a plan outputs an iterator over partitions, so that others can iterate against (which would avoid us having to guarantee the invariant mentioned above through discipline), but you probably already though through this alternative :) Typically, we want to try and execute the partitions in parallel, either on separate threads in a single process or in separate processes in a cluster. We may want to use `async` for the execute method in the future. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org