gauravkm commented on PR #3261:
URL: https://github.com/apache/celeborn/pull/3261#issuecomment-2947838219
>Have you tested this PR on a Celeborn cluster to check if there are any
changes in performance and stability? Especially when there are many partition
locations, the driver's memory can be stable.
Good question!
* Yes, we have been running 1000's of applications within Stripe with a
similar implementation for the last 3 months
* Our default partition count is 5k, and we have apps with 100k-600k mappers
in a single stage (and multiple such stages) that have been running reliably
and performantly
* Internally at Stripe - we enable these checks by default, so every app
runs with these checks
* The additional memory footprint on the driver is independent of number of
PartitionLocations and only depends on the number of partitions * number of
stages in the app
Per stage additional memory can be computed as - number of partitions * 12
bytes (4 for crc32, 8 for byte count)
So for 5k partitions -> 60KB
- Assuming even 100 such stages, the additional memory footprint for the
driver is 6MB. We haven't observed any driver OOM issues etc with this
implementation
- Apart from this if the app has skewed partitions, there is some additional
tracking of sub-partition information. Even assuming 1000's of sub-partitions
and 100 additional bytes per sub-partition, we are looking at 100 KB of
additional memory
* With respect to perf, our benchmarks show that for large apps shuffle time
dominates the task time anyway and the overhead from these checks is minimal.
We didn't measure any meaningful increase in overall Spark app run time due to
these checks.
- If you look at the
[design](https://docs.google.com/document/d/1YqK0kua-5rMufJw57kEIrHHGbLnAF9iXM5GdDweMzzg/edit?tab=t.0#heading=h.n5ldma432qnd),
the first iteration was very different. We found perf issues with that
implementation in certain scenarios (specifically when the app has a lot of
empty partitions) and the v2/v3 design eliminated those performance concerns
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]