alamb opened a new pull request #1748: URL: https://github.com/apache/arrow-datafusion/pull/1748
# Which issue does this PR close? Along with https://github.com/apache/arrow-datafusion/pull/1732, Fixes https://github.com/apache/arrow-datafusion/issues/423 (the last part). # Rationale for this change Repartitioning the input to an operator that relies on its input to be sorted is incorrect as the repartitioning will intermix the partitions and effectively "unsort" the input stream We found this in IOx here https://github.com/influxdata/influxdb_iox/pull/3633#issuecomment-1030126757 Here is a picture showing the problem: ``` ┌─────────────────────────────────┐ │ │ │ SortPreservingMergeExec │ │ │ └─────────────────────────────────┘ ▲ ▲ ▲ Input may not │ │ │ be sorted! ┌───────┘ │ └───────┐ │ │ │ │ │ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ │ │ │ │ │ │ batch A1 │ │ batch A3 │ │ batch B3 │ │ │ │ │ │ │ ├───────────┤ ├───────────┤ ├───────────┤ │ │ │ │ │ │ │ batch B2 │ │ batch B1 │ │ batch A2 │ │ │ │ │ │ │ └───────────┘ └───────────┘ └───────────┘ ▲ ▲ ▲ │ │ │ └─────────┐ │ ┌──────────┘ │ │ │ Outputs │ │ │ batches are ┌─────────────────────────────────┐ repartitioned │ RepartitionExec(3) │ and may not │ RoundRobin │ remain sorted │ │ └─────────────────────────────────┘ ▲ ▲ │ │ Inputs are ┌─────┘ └─────┐ sorted │ │ │ │ │ │ ┌───────────┐ ┌───────────┐ │ │ │ │ │ batch A1 │ │ batch B1 │ │ │ │ │ ├───────────┤ ├───────────┤ │ │ │ │ │ batch A2 │ │ batch B2 │ │ │ │ │ ├───────────┤ ├───────────┤ │ │ │ │ │ batch A3 │ │ batch B3 │ │ │ │ │ └───────────┘ └───────────┘ Sorted Input Sorted Input A B ``` The streams need to remain the way they were ``` ┌─────────────────────────────────┐ │ │ │ SortPreservingMergeExec │ │ │ └─────────────────────────────────┘ ▲ ▲ │ │ Inputs are ┌─────┘ └─────┐ sorted, as │ │ required │ │ │ │ ┌───────────┐ ┌───────────┐ │ │ │ │ │ batch A1 │ │ batch B1 │ │ │ │ │ ├───────────┤ ├───────────┤ │ │ │ │ │ batch A2 │ │ batch B2 │ │ │ │ │ ├───────────┤ ├───────────┤ │ │ │ │ │ batch A3 │ │ batch B3 │ │ │ │ │ └───────────┘ └───────────┘ Sorted Input Sorted Input A B ``` # What changes are included in this PR? 1. use flag introduced in https://github.com/apache/arrow-datafusion/pull/1732 to not repartition the input to `SortPreservingMerge` operator 2. Test to ensure it is not repartitioned # Are there any user-facing changes? Probably not (yet) as `SortPreservingMerge` isn't used as part of SQL planning \ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
