Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/18388
  
    So that is an issue.  If users are running spark 1.6 or spark 2.1 on the 
same cluster as the new one with this feature, you can't upgrade the shuffle 
service until no one runs those.  We run multiple versions on a cluster at the 
same time and not everyone immediately upgrades. For instance when a new 
version comes out, like 2.2 its a bit unstable initially, so production jobs 
stay on older versions until the newer one stabilizes. 
    
    You need to make it backwards compatible or come up with different 
approach.   
    
    https://github.com/apache/spark/pull/18487 is the pull request for limiting 
the reducer fetch at once.  Still needs reviewed. It hasn't been run in 
production yet though but we haven't had issues with NM crashing since we 
changed the openblocks to be lazy so we wouldn't know immediately how much it 
helped.  that approach definitely helps on the mapreduce/tez side.  But 
depending on what is actually happening may or may not help.
    
    The other approach which is less nice is to just have to reject the 
connection (without returning the failure message) but the client side wouldn't 
necessarily know why so you would have to make sure it still retried.
    
    But I'm actually still wondering about the root cause here.
    
    I'm wondering what is actually using the memory.  You said it was the netty 
chunks, Are you using SSL?  I had thought that the netty calls we were using 
use transferTo which shouldn't pull the data into memoy, that is of course 
unless you are using ssl which I don't think can use transferTo.
    Or are you seeing lots of chunks in memory from the same fetcher?  ie you 
do openblocks of 500 blocks, is it opening all 500 file descriptors at once?  I 
didn't think we did this but want to double check.  If we were doing this we 
should stop by only open a few and when one finishes, open the next.  
    Or is it just that you have 100's connections from different fetchers and 
each one has 1 chunk in memory?
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to