Hi, all In HDInsight, we (Microsoft) use Livy as the Spark job submission service. We keep seeing the customers fall into the problem when they submit many concurrent applications to the system, or recover livy from a state with many concurrent applications
By looking at the code and the customers' exception stack, we lock down the problem to the application monitoring module where a new thread is created for each application. To resolve the issue, we propose a actor-based design of application monitoring module and share it here (as new JIRA seems not working yet) *https://docs.google.com/document/d/1yDl5_3wPuzyGyFmSOzxRp6P-nbTQTdDFXl2XQhXDiwA/edit?usp=sharing <https://docs.google.com/document/d/1yDl5_3wPuzyGyFmSOzxRp6P-nbTQTdDFXl2XQhXDiwA/edit?usp=sharing>* We are glad to hear feedbacks from the community and improve the design before we start implementing it! Best, Nan
