Re: resolve the scalability problem caused by app monitoring in livy with an actor-based design

Nan Zhu Wed, 16 Aug 2017 14:44:25 -0700

Thanks for the reply, Meisam

Looks good to meet our current scenarios


Rest APIs looks like more powerful but it needs to replace the current
YarnClient with a self-made RestClient



On Wed, Aug 16, 2017 at 2:23 PM, Meisam Fathi <[email protected]>
wrote:

> Hi Nan,
>
> In the highlighted line
> >
> > https://github.com/apache/incubator-livy/pull/36/files#diff-
> a3f879755cfe10a678cc08ddbe60a4d3R75
> >
> > I assume that it will get the reports of all applications in YARN, even
> > they are finished?
>
>
> That's right. That line will return reports for all Spark Applications,
> even applications that completed a long time ago. For us YARN retains
> reports for a few thousand completed applications (not a big concern).
>
> Livy needs to get the reports for applications that finished recently, but
> I didn't find an API in YARN 2.7 to get those only reports.
>
> Thanks,
> Meisam
>
> >
> >
>
> > On Wed, Aug 16, 2017 at 12:25 PM, Meisam Fathi <[email protected]>
> > wrote:
> >
> > > Hi Nan,
> > >
> > >
> > > >
> > > > my question related to the undergoing discussion is simply "have you
> > seen
> > > > any performance issue in
> > > >
> > > > https://github.com/apache/incubator-livy/pull/36/files#diff-
> > > a3f879755cfe10a678cc08ddbe60a4d3R75
> > > > ?
> > > > <https://github.com/apache/incubator-livy/pull/36/files#diff-
> > > a3f879755cfe10a678cc08ddbe60a4d3R75?>
> > > > "
> > > >
> > > > The short answer is yes. This PR fixes one part of the scalability
> > > problem, which is, it prevents Livy from creating many
> > > yarnAppMinotorThreads. But the two other parts are still there
> > >
> > > 1. one call to spark-submit for each application
> > > 2. once thread that waits for the exit code of spark-submit.
> > >
> > > Out of these two problems, calling one spark-submit per application is
> > the
> > > biggest problem, but it can be solved by adding more Livy servers. We
> > > modified Livy so if an application status changes on one Livy instance,
> > all
> > > other Livy instances get the updated information about the application.
> > > From users' perspective, this is transparent because users just see the
> > > load balancer.
> > >
> > > So, refactoring the yarn poll mechanism + a load balancer and a grid of
> > > Livy servers fixed the scalability issue.
> > >
> > > On the performance of the code itself, we have not had an issue. The
> time
> > > consuming parts in the code are calls to YARN and not filtering and
> > > updating the data structures. On memory usage, this all needs less than
> > 1GB
> > > at peak time.
> > >
> > > I hope this answers your question.
> > >
> > > Thanks,
> > > Meisam
> > >
> > >
> > > > We have several scenarios that a large volume of applications are
> > > submitted
> > > > to YARN every day and it easily accumulates a lot to be fetched with
> > this
> > > > call
> > > >
> > > > Best,
> > > >
> > > > Nan
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Wed, Aug 16, 2017 at 11:16 AM, Meisam Fathi <
> [email protected]
> > >
> > > > wrote:
> > > >
> > > > > Here are my two pennies on both designs (actor-based design vs.
> > > > > single-thread polling design)
> > > > >
> > > > > *Single-thread polling design*
> > > > > We implemented a single-thread polling mechanism for Yarn here at
> > > PayPal.
> > > > > Our solution is more involved because we added many new features to
> > > Livy
> > > > > that we had to consider when we refactored Livy's YARN interface.
> But
> > > we
> > > > > are willing to hammer our changes so it suits the need of the Livy
> > > > > community best :-)
> > > > >
> > > > > *Actor-based design*
> > > > > It seems to me that the proposed actor based design (
> > > > > https://docs.google.com/document/d/1yDl5_3wPuzyGyFmSOzxRp6P-
> > > > > nbTQTdDFXl2XQhXDiwA/edit)
> > > > > needs a few more messages and actors. Here is why.
> > > > > Livy makes three (blocking) calls to YARN
> > > > > 1. `yarnClient.getApplications`, which gives Livy `ApplicatioId`s
> > > > > 2. `yarnClient.getApplicationAttemptReport(ApplicationId)`, which
> > > gives
> > > > > Livy `getAMContainerId`
> > > > > 3. `yarnClient.getContainerReport`, which gives Livy tracking URLs
> > > > >
> > > > > The result of the previous call is needed to make the next call.
> The
> > > > > proposed actor system needs to be designed to handles all these
> > > blocking
> > > > > calls.
> > > > >
> > > > > I do agree that actor based design is cleaner and more
> maintainable.
> > > But
> > > > we
> > > > > had to discard it because it adds more dependencies to Livy. We
> faced
> > > too
> > > > > many dependency-version-mismatch problems with Livy interactive
> > > sessions
> > > > > (when applications depend on a different version of a library that
> is
> > > > used
> > > > > internally by Livy). If the livy community prefers an actor based
> > > design,
> > > > > we are willing to reimplement our changes with an actor system.
> > > > >
> > > > > Finally, either design is only the first step in fixing this
> > particular
> > > > > scalability problem. The reason is that the *"yarnAppMinotorThread"
> > is
> > > > not
> > > > > the only thread that Livy spawns per Spark application.* For batch
> > > jobs,
> > > > > Livy
> > > > > 1. calls spark-submit, which lunches a new JVM (an operations that
> is
> > > far
> > > > > more heavy than creating a thread and can easily drain the system)
> > > > > 2. It create a thread that waits for the exist code of
> spark-submit.
> > > Even
> > > > > though this thread is "short-lived", at peak time thousands of such
> > > > threads
> > > > > are created in a few seconds.
> > > > >
> > > > > I created a PR with our modifications to it.
> > > > >
> > > > > https://github.com/apache/incubator-livy/pull/36
> > > > >
> > > > > Thanks,
> > > > > Meisam
> > > > >
> > > > > On Wed, Aug 16, 2017 at 10:42 AM Marcelo Vanzin <
> [email protected]
> > >
> > > > > wrote:
> > > > >
> > > > > > Hello,
> > > > > >
> > > > > > On Wed, Aug 16, 2017 at 10:35 AM, Arijit Tarafdar
> > > > > > <[email protected]> wrote:
> > > > > > > 1. Additional copy of states in Livy which can be queried from
> > YARN
> > > > on
> > > > > > request.
> > > > > >
> > > > > > Not sure I follow.
> > > > > >
> > > > > > > 2. The design is not event driven and may waste querying YARN
> > > > > > unnecessarily when no actual user/external request is pending.
> > > > > >
> > > > > > You don't need to keep querying YARN if there are no apps to
> > monitor.
> > > > > >
> > > > > > > 3. There will always be an issue with stale data and update
> > latency
> > > > > > between actual YARN state and Livy state map.
> > > > > >
> > > > > > That is also the case with a thread pool that has less threads
> than
> > > > > > the number of apps being monitored, if making one request per
> app.
> > > > > >
> > > > > > > 4. Size and latency of the response in bulk querying YARN is
> > > unknown.
> > > > > >
> > > > > > That is also unknown when making multiple requests from multiple
> > > > > > threads, unless you investigate the internal implementation of
> both
> > > > > > YARN clients and servers.
> > > > > >
> > > > > > > 5. YARN bulk API needs to support filtering at the query level.
> > > > > >
> > > > > > Yes, I mentioned that in my original response and really was just
> > > > > > expecting Nan to do a quick investigation of that implementation
> > > > > > option.
> > > > > >
> > > > > > He finally did and it seems that the API only exists through the
> > REST
> > > > > > interface, so this all is moot.
> > > > > >
> > > > > > --
> > > > > > Marcelo
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: resolve the scalability problem caused by app monitoring in livy with an actor-based design

Reply via email to