Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/14162
  
    > To be fair if you have a bug in another part of the shuffle service that 
is not in the startup path, it still could take out your whole cluster. That 
can't be fixed until the NM runs aux services in separate processes.
    
    Not sure what you mean by this, normally if bug in shuffle service it only 
affects Spark since Spark is the only one trying to access it.  The other 
routines like initializeapplication and stopapplication called by NM all catch 
exceptions also.  Although looking at NM code it doesn't matter because it 
catches them itself and just logs.  Obviously if its really bad such that it 
causes segfault or memory leak it still can take it out, but normal exception 
in processing request from Spark application shouldn't take NM out.
    
    Actually now that you point it out the try/catch that you removed should 
really be around the rest of the code in the init function as well. 
    
    But yes I would like to see config because we will run it with current 
behavior.  If you want to run it without the catch then we need a config to be 
able to run in both modes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to