Here are my two pennies on both designs (actor-based design vs. single-thread polling design)
*Single-thread polling design* We implemented a single-thread polling mechanism for Yarn here at PayPal. Our solution is more involved because we added many new features to Livy that we had to consider when we refactored Livy's YARN interface. But we are willing to hammer our changes so it suits the need of the Livy community best :-) *Actor-based design* It seems to me that the proposed actor based design ( https://docs.google.com/document/d/1yDl5_3wPuzyGyFmSOzxRp6P-nbTQTdDFXl2XQhXDiwA/edit) needs a few more messages and actors. Here is why. Livy makes three (blocking) calls to YARN 1. `yarnClient.getApplications`, which gives Livy `ApplicatioId`s 2. `yarnClient.getApplicationAttemptReport(ApplicationId)`, which gives Livy `getAMContainerId` 3. `yarnClient.getContainerReport`, which gives Livy tracking URLs The result of the previous call is needed to make the next call. The proposed actor system needs to be designed to handles all these blocking calls. I do agree that actor based design is cleaner and more maintainable. But we had to discard it because it adds more dependencies to Livy. We faced too many dependency-version-mismatch problems with Livy interactive sessions (when applications depend on a different version of a library that is used internally by Livy). If the livy community prefers an actor based design, we are willing to reimplement our changes with an actor system. Finally, either design is only the first step in fixing this particular scalability problem. The reason is that the *"yarnAppMinotorThread" is not the only thread that Livy spawns per Spark application.* For batch jobs, Livy 1. calls spark-submit, which lunches a new JVM (an operations that is far more heavy than creating a thread and can easily drain the system) 2. It create a thread that waits for the exist code of spark-submit. Even though this thread is "short-lived", at peak time thousands of such threads are created in a few seconds. I created a PR with our modifications to it. https://github.com/apache/incubator-livy/pull/36 Thanks, Meisam On Wed, Aug 16, 2017 at 10:42 AM Marcelo Vanzin <[email protected]> wrote: > Hello, > > On Wed, Aug 16, 2017 at 10:35 AM, Arijit Tarafdar > <[email protected]> wrote: > > 1. Additional copy of states in Livy which can be queried from YARN on > request. > > Not sure I follow. > > > 2. The design is not event driven and may waste querying YARN > unnecessarily when no actual user/external request is pending. > > You don't need to keep querying YARN if there are no apps to monitor. > > > 3. There will always be an issue with stale data and update latency > between actual YARN state and Livy state map. > > That is also the case with a thread pool that has less threads than > the number of apps being monitored, if making one request per app. > > > 4. Size and latency of the response in bulk querying YARN is unknown. > > That is also unknown when making multiple requests from multiple > threads, unless you investigate the internal implementation of both > YARN clients and servers. > > > 5. YARN bulk API needs to support filtering at the query level. > > Yes, I mentioned that in my original response and really was just > expecting Nan to do a quick investigation of that implementation > option. > > He finally did and it seems that the API only exists through the REST > interface, so this all is moot. > > -- > Marcelo >
