Re: resolve the scalability problem caused by app monitoring in livy with an actor-based design

Meisam Fathi Wed, 16 Aug 2017 11:17:18 -0700

Here are my two pennies on both designs (actor-based design vs.
single-thread polling design)

*Single-thread polling design*
We implemented a single-thread polling mechanism for Yarn here at PayPal.
Our solution is more involved because we added many new features to Livy
that we had to consider when we refactored Livy's YARN interface. But we
are willing to hammer our changes so it suits the need of the Livy
community best :-)

*Actor-based design*
It seems to me that the proposed actor based design (
https://docs.google.com/document/d/1yDl5_3wPuzyGyFmSOzxRp6P-nbTQTdDFXl2XQhXDiwA/edit)
needs a few more messages and actors. Here is why.
Livy makes three (blocking) calls to YARN
1. `yarnClient.getApplications`, which gives Livy `ApplicatioId`s
2. `yarnClient.getApplicationAttemptReport(ApplicationId)`, which gives
Livy `getAMContainerId`
3. `yarnClient.getContainerReport`, which gives Livy tracking URLs

The result of the previous call is needed to make the next call. The
proposed actor system needs to be designed to handles all these blocking
calls.

I do agree that actor based design is cleaner and more maintainable. But we
had to discard it because it adds more dependencies to Livy. We faced too
many dependency-version-mismatch problems with Livy interactive sessions
(when applications depend on a different version of a library that is used
internally by Livy). If the livy community prefers an actor based design,
we are willing to reimplement our changes with an actor system.

Finally, either design is only the first step in fixing this particular
scalability problem. The reason is that the *"yarnAppMinotorThread" is not
the only thread that Livy spawns per Spark application.* For batch jobs,
Livy
1. calls spark-submit, which lunches a new JVM (an operations that is far
more heavy than creating a thread and can easily drain the system)
2. It create a thread that waits for the exist code of spark-submit. Even
though this thread is "short-lived", at peak time thousands of such threads
are created in a few seconds.

I created a PR with our modifications to it.

https://github.com/apache/incubator-livy/pull/36

Thanks,
Meisam

On Wed, Aug 16, 2017 at 10:42 AM Marcelo Vanzin <[email protected]> wrote:

> Hello,
>
> On Wed, Aug 16, 2017 at 10:35 AM, Arijit Tarafdar
> <[email protected]> wrote:
> > 1. Additional copy of states in Livy which can be queried from YARN on
> request.
>
> Not sure I follow.
>
> > 2. The design is not event driven and may waste querying YARN
> unnecessarily when no actual user/external request is pending.
>
> You don't need to keep querying YARN if there are no apps to monitor.
>
> > 3. There will always be an issue with stale data and update latency
> between actual YARN state and Livy state map.
>
> That is also the case with a thread pool that has less threads than
> the number of apps being monitored, if making one request per app.
>
> > 4. Size and latency of the response in bulk querying YARN is unknown.
>
> That is also unknown when making multiple requests from multiple
> threads, unless you investigate the internal implementation of both
> YARN clients and servers.
>
> > 5. YARN bulk API needs to support filtering at the query level.
>
> Yes, I mentioned that in my original response and really was just
> expecting Nan to do a quick investigation of that implementation
> option.
>
> He finally did and it seems that the API only exists through the REST
> interface, so this all is moot.
>
> --
> Marcelo
>

Re: resolve the scalability problem caused by app monitoring in livy with an actor-based design

Reply via email to