alexeykudinkin commented on PR #7723:
URL: https://github.com/apache/hudi/pull/7723#issuecomment-1404524518
To validate this, i've run a few benchmarks validating following scenarios:
- (A) Properly sized input (each partition is at around ~100Mb)
- (B) Improperly sized input (each partition is at around <10Mb)
Both of this scenario were run w/ following configurations:
- With shuffle parallelism set at 200 (current default)
- With shuffle parallelism NOT set (ie relying on the partitioning of the
input dataset, which is ~500 for A and ~6500 for B)
After this PR we see:
- **20%** improvement in write-performance for scenario A
- 9% regression in write-performance for scenario B
Conclusions: this PR
- Would require for cases with improperly sized inputs to now configure
shuffling explicitly (users could even blindly set the value at 200 to get the
same behavior as before)
- However, it will allow to significantly improve out of the box (!)
performance for workloads with reasonably partitioned payloads
```
# Scenario: A
# Configs: w/ Defaults (200)
#
# REF:
https://p-2tmnlh3lk4kq6-shs.emrappui-prod.us-east-2.amazonaws.com/shs/history/application_1674681073325_0002/jobs/
==================================================
Total time taken by all rounds (hudi): 1996805
Per round: List(167506, 510834, 604900, 713565)
==================================================
# Scenario: A
# Configs: w/o Defaults
#
# REF:
https://p-2tmnlh3lk4kq6-shs.emrappui-prod.us-east-2.amazonaws.com/shs/history/application_1674681073325_0005/jobs/
==================================================
Total time taken by all rounds (hudi): 1611548
Per round: List(161764, 422203, 537490, 490091)
==================================================
# Scenario: B
# Configs: w/ Defaults (200)
#
# REF:
https://p-2tmnlh3lk4kq6-shs.emrappui-prod.us-east-2.amazonaws.com/shs/history/application_1674681073325_0003/jobs/
==================================================
Total time taken by all rounds (hudi): 1728415
Per round: List(125983, 454762, 542716, 604954)
==================================================
# Scenario: B
# Configs: w/o Defaults
#
# REF:
https://p-2tmnlh3lk4kq6-shs.emrappui-prod.us-east-2.amazonaws.com/shs/history/application_1674681073325_0004/jobs/
==================================================
Total time taken by all rounds (hudi): 1877736
Per round: List(159811, 479113, 597456, 641356)
==================================================
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]