Thanks for the reply, Meisam Looks good to meet our current scenarios
Rest APIs looks like more powerful but it needs to replace the current YarnClient with a self-made RestClient On Wed, Aug 16, 2017 at 2:23 PM, Meisam Fathi <[email protected]> wrote: > Hi Nan, > > In the highlighted line > > > > https://github.com/apache/incubator-livy/pull/36/files#diff- > a3f879755cfe10a678cc08ddbe60a4d3R75 > > > > I assume that it will get the reports of all applications in YARN, even > > they are finished? > > > That's right. That line will return reports for all Spark Applications, > even applications that completed a long time ago. For us YARN retains > reports for a few thousand completed applications (not a big concern). > > Livy needs to get the reports for applications that finished recently, but > I didn't find an API in YARN 2.7 to get those only reports. > > Thanks, > Meisam > > > > > > > > On Wed, Aug 16, 2017 at 12:25 PM, Meisam Fathi <[email protected]> > > wrote: > > > > > Hi Nan, > > > > > > > > > > > > > > my question related to the undergoing discussion is simply "have you > > seen > > > > any performance issue in > > > > > > > > https://github.com/apache/incubator-livy/pull/36/files#diff- > > > a3f879755cfe10a678cc08ddbe60a4d3R75 > > > > ? > > > > <https://github.com/apache/incubator-livy/pull/36/files#diff- > > > a3f879755cfe10a678cc08ddbe60a4d3R75?> > > > > " > > > > > > > > The short answer is yes. This PR fixes one part of the scalability > > > problem, which is, it prevents Livy from creating many > > > yarnAppMinotorThreads. But the two other parts are still there > > > > > > 1. one call to spark-submit for each application > > > 2. once thread that waits for the exit code of spark-submit. > > > > > > Out of these two problems, calling one spark-submit per application is > > the > > > biggest problem, but it can be solved by adding more Livy servers. We > > > modified Livy so if an application status changes on one Livy instance, > > all > > > other Livy instances get the updated information about the application. > > > From users' perspective, this is transparent because users just see the > > > load balancer. > > > > > > So, refactoring the yarn poll mechanism + a load balancer and a grid of > > > Livy servers fixed the scalability issue. > > > > > > On the performance of the code itself, we have not had an issue. The > time > > > consuming parts in the code are calls to YARN and not filtering and > > > updating the data structures. On memory usage, this all needs less than > > 1GB > > > at peak time. > > > > > > I hope this answers your question. > > > > > > Thanks, > > > Meisam > > > > > > > > > > We have several scenarios that a large volume of applications are > > > submitted > > > > to YARN every day and it easily accumulates a lot to be fetched with > > this > > > > call > > > > > > > > Best, > > > > > > > > Nan > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Aug 16, 2017 at 11:16 AM, Meisam Fathi < > [email protected] > > > > > > > wrote: > > > > > > > > > Here are my two pennies on both designs (actor-based design vs. > > > > > single-thread polling design) > > > > > > > > > > *Single-thread polling design* > > > > > We implemented a single-thread polling mechanism for Yarn here at > > > PayPal. > > > > > Our solution is more involved because we added many new features to > > > Livy > > > > > that we had to consider when we refactored Livy's YARN interface. > But > > > we > > > > > are willing to hammer our changes so it suits the need of the Livy > > > > > community best :-) > > > > > > > > > > *Actor-based design* > > > > > It seems to me that the proposed actor based design ( > > > > > https://docs.google.com/document/d/1yDl5_3wPuzyGyFmSOzxRp6P- > > > > > nbTQTdDFXl2XQhXDiwA/edit) > > > > > needs a few more messages and actors. Here is why. > > > > > Livy makes three (blocking) calls to YARN > > > > > 1. `yarnClient.getApplications`, which gives Livy `ApplicatioId`s > > > > > 2. `yarnClient.getApplicationAttemptReport(ApplicationId)`, which > > > gives > > > > > Livy `getAMContainerId` > > > > > 3. `yarnClient.getContainerReport`, which gives Livy tracking URLs > > > > > > > > > > The result of the previous call is needed to make the next call. > The > > > > > proposed actor system needs to be designed to handles all these > > > blocking > > > > > calls. > > > > > > > > > > I do agree that actor based design is cleaner and more > maintainable. > > > But > > > > we > > > > > had to discard it because it adds more dependencies to Livy. We > faced > > > too > > > > > many dependency-version-mismatch problems with Livy interactive > > > sessions > > > > > (when applications depend on a different version of a library that > is > > > > used > > > > > internally by Livy). If the livy community prefers an actor based > > > design, > > > > > we are willing to reimplement our changes with an actor system. > > > > > > > > > > Finally, either design is only the first step in fixing this > > particular > > > > > scalability problem. The reason is that the *"yarnAppMinotorThread" > > is > > > > not > > > > > the only thread that Livy spawns per Spark application.* For batch > > > jobs, > > > > > Livy > > > > > 1. calls spark-submit, which lunches a new JVM (an operations that > is > > > far > > > > > more heavy than creating a thread and can easily drain the system) > > > > > 2. It create a thread that waits for the exist code of > spark-submit. > > > Even > > > > > though this thread is "short-lived", at peak time thousands of such > > > > threads > > > > > are created in a few seconds. > > > > > > > > > > I created a PR with our modifications to it. > > > > > > > > > > https://github.com/apache/incubator-livy/pull/36 > > > > > > > > > > Thanks, > > > > > Meisam > > > > > > > > > > On Wed, Aug 16, 2017 at 10:42 AM Marcelo Vanzin < > [email protected] > > > > > > > > wrote: > > > > > > > > > > > Hello, > > > > > > > > > > > > On Wed, Aug 16, 2017 at 10:35 AM, Arijit Tarafdar > > > > > > <[email protected]> wrote: > > > > > > > 1. Additional copy of states in Livy which can be queried from > > YARN > > > > on > > > > > > request. > > > > > > > > > > > > Not sure I follow. > > > > > > > > > > > > > 2. The design is not event driven and may waste querying YARN > > > > > > unnecessarily when no actual user/external request is pending. > > > > > > > > > > > > You don't need to keep querying YARN if there are no apps to > > monitor. > > > > > > > > > > > > > 3. There will always be an issue with stale data and update > > latency > > > > > > between actual YARN state and Livy state map. > > > > > > > > > > > > That is also the case with a thread pool that has less threads > than > > > > > > the number of apps being monitored, if making one request per > app. > > > > > > > > > > > > > 4. Size and latency of the response in bulk querying YARN is > > > unknown. > > > > > > > > > > > > That is also unknown when making multiple requests from multiple > > > > > > threads, unless you investigate the internal implementation of > both > > > > > > YARN clients and servers. > > > > > > > > > > > > > 5. YARN bulk API needs to support filtering at the query level. > > > > > > > > > > > > Yes, I mentioned that in my original response and really was just > > > > > > expecting Nan to do a quick investigation of that implementation > > > > > > option. > > > > > > > > > > > > He finally did and it seems that the API only exists through the > > REST > > > > > > interface, so this all is moot. > > > > > > > > > > > > -- > > > > > > Marcelo > > > > > > > > > > > > > > > > > > > > >
