Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/18388
So that is an issue. If users are running spark 1.6 or spark 2.1 on the
same cluster as the new one with this feature, you can't upgrade the shuffle
service until no one runs those. We run multiple versions on a cluster at the
same time and not everyone immediately upgrades. For instance when a new
version comes out, like 2.2 its a bit unstable initially, so production jobs
stay on older versions until the newer one stabilizes.
You need to make it backwards compatible or come up with different
approach.
https://github.com/apache/spark/pull/18487 is the pull request for limiting
the reducer fetch at once. Still needs reviewed. It hasn't been run in
production yet though but we haven't had issues with NM crashing since we
changed the openblocks to be lazy so we wouldn't know immediately how much it
helped. that approach definitely helps on the mapreduce/tez side. But
depending on what is actually happening may or may not help.
The other approach which is less nice is to just have to reject the
connection (without returning the failure message) but the client side wouldn't
necessarily know why so you would have to make sure it still retried.
But I'm actually still wondering about the root cause here.
I'm wondering what is actually using the memory. You said it was the netty
chunks, Are you using SSL? I had thought that the netty calls we were using
use transferTo which shouldn't pull the data into memoy, that is of course
unless you are using ssl which I don't think can use transferTo.
Or are you seeing lots of chunks in memory from the same fetcher? ie you
do openblocks of 500 blocks, is it opening all 500 file descriptors at once? I
didn't think we did this but want to double check. If we were doing this we
should stop by only open a few and when one finishes, open the next.
Or is it just that you have 100's connections from different fetchers and
each one has 1 chunk in memory?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]