aokolnychyi commented on PR #7714: URL: https://github.com/apache/iceberg/pull/7714#issuecomment-1603503322
I gave it a bit of testing on a cluster. In some cases, I actually experienced quite some degradation when the split size was adjusted to a higher value. The shuffle write time increased quite dramatically when I was processing entire records. I think it is related to the fact that Spark needs to sort the records based on reducer ID during the map phase of a shuffle if the hash shuffle manager is not used (> 200 reducers). There were cases when it helped but it seems too risky to do by default. I will rework this approach to only pick a smaller split size to utilize all cluster slots. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
