milenkovicm commented on PR #1752:
URL: 
https://github.com/apache/datafusion-ballista/pull/1752#issuecomment-4597180155

   After spending the long weekend with this PR, key takeout
   
   Setting
   
   ```
   datafusion.optimizer.hash_join_single_partition_threshold=1048576
   ```
   
   Works, but not all stages have actual byte size reported, missing chance for 
further broadcast optimization
   
   Setting 
   
   ```
   datafusion.optimizer.hash_join_single_partition_threshold_rows=100_000
   ```
   
   At some point gets tucked with wrong row number reported, making a stage 
injecting billions or rows single partitioned, that stage makes job execution 
into 200sec range
   
   Possible follow ups
   
   - find the reason for wrong statistics 
   - when join is not collect left, we make it partitioned and run both stages 
at the same time. We should change this and make educated guess to run one of 
them first hoping it will produce change which could be folded into broadcast 
   
   Will do code clean up merge this pr and follow up with improvements as it's 
definitely improving things 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to