milenkovicm commented on PR #1752: URL: https://github.com/apache/datafusion-ballista/pull/1752#issuecomment-4597180155
After spending the long weekend with this PR, key takeout Setting ``` datafusion.optimizer.hash_join_single_partition_threshold=1048576 ``` Works, but not all stages have actual byte size reported, missing chance for further broadcast optimization Setting ``` datafusion.optimizer.hash_join_single_partition_threshold_rows=100_000 ``` At some point gets tucked with wrong row number reported, making a stage injecting billions or rows single partitioned, that stage makes job execution into 200sec range Possible follow ups - find the reason for wrong statistics - when join is not collect left, we make it partitioned and run both stages at the same time. We should change this and make educated guess to run one of them first hoping it will produce change which could be folded into broadcast Will do code clean up merge this pr and follow up with improvements as it's definitely improving things -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
