[GitHub] [iceberg] aokolnychyi commented on pull request #7714: Spark 3.4: Adaptive split size

via GitHub Thu, 22 Jun 2023 18:19:00 -0700


aokolnychyi commented on PR #7714:
URL: https://github.com/apache/iceberg/pull/7714#issuecomment-1603503322


   I gave it a bit of testing on a cluster. In some cases, I actually 
experienced quite some degradation when the split size was adjusted to a higher 
value. The shuffle write time increased quite dramatically when I was 
processing entire records. I think it is related to the fact that Spark needs 
to sort the records based on reducer ID during the map phase of a shuffle if 
the hash shuffle manager is not used (> 200 reducers). There were cases when it 
helped but it seems too risky to do by default.
   
   I will rework this approach to only pick a smaller split size to utilize all 
cluster slots.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] aokolnychyi commented on pull request #7714: Spark 3.4: Adaptive split size

Reply via email to