kosiew commented on issue #18513: URL: https://github.com/apache/datafusion/issues/18513#issuecomment-3695592071
hi @AdamGS I investigated this and the extra RepartitionExec in the filtered plan isn’t an arbitrary inefficiency—it’s inserted by the distribution optimizer to raise the number of partitions when it estimates parallel round-robin repartitioning will be beneficial. The factors are governed by target_partitions, enable_round_robin_repartition, and repartition_file_scans settings. https://github.com/kosiew/datafusion/blob/4960284541a8394034fd7f82833571fd601633bf/datafusion/physical-optimizer/src/enforce_distribution.rs#L1181-L1341 Since file-scan repartitioning is enabled by default, even small inputs may be repartitioned for parallelism; you can turn it off or lower target_partitions if the overhead outweighs the benefit for tiny datasets. https://github.com/kosiew/datafusion/blob/4960284541a8394034fd7f82833571fd601633bf/datafusion/common/src/config.rs#L952-L966 https://github.com/kosiew/datafusion/blob/4960284541a8394034fd7f82833571fd601633bf/datafusion/physical-optimizer/src/enforce_distribution.rs#L1181-L1341 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
