Hi Chester,

Thanks for the feedback. A few of those are great candidates for
improvements to the launcher library.

On Wed, May 13, 2015 at 5:44 AM, Chester At Work <ches...@alpinenow.com>
wrote:

>      1) client should not be private ( unless alternative is provided) so
> we can call it directly.
>

Patrick already touched on this subject, but I believe Client should be
kept private. If we want to expose functionality for code launching Spark
apps, Spark should provide an interface for that so that other cluster
managers can benefit. It also keeps the API more consistent (everybody uses
the same API regardless of what's the underlying cluster manager).


>      2) we need a way to stop the running yarn app programmatically ( the
> PR is already submitted)
>

My first reaction to this was "with the app id, you can talk to YARN
directly and do that". But given what I wrote above, I guess it would make
sense for something like this to be exposed through the library too.


>      3) before we start the spark job, we should have a call back to the
> application, which will provide the yarn container capacity (number of
> cores and max memory ), so spark program will not set values beyond max
> values (PR submitted)
>

I'm not sure exactly what you mean here, but it feels like we're starting
to get into "wrapping the YARN API" territory. Someone who really cares
about that information can easily fetch it from YARN, the same way Spark
would.


>      4) call back could be in form of yarn app listeners, which call back
> based on yarn status changes ( start, in progress, failure, complete etc),
> application can react based on these events in PR)
>

Exposing some sort of status for the running application does sound useful.


>      5) yarn client passing arguments to spark program in the form of main
> program, we had experience problems when we pass a very large argument due
> the length limit. For example, we use json to serialize the argument and
> encoded, then parse them as argument. For wide columns datasets, we will
> run into limit. Therefore, an alternative way of passing additional larger
> argument is needed. We are experimenting with passing the args via a
> established akka messaging channel.
>

I believe you're talking about command line arguments to your application
here? I'm not sure what Spark can do to alleviate this. With YARN, if you
need to pass a lot of information to your application, I'd recommend
creating a file and adding it to "--files" when calling spark-submit. It's
not optimal, since the file will be distributed to all executors (not just
the AM), but it should help if you're really running into problems, and is
simpler than using akka.


>     6) spark yarn client in yarn-cluster mode right now is essentially a
> batch job with no communication once it launched. Need to establish the
> communication channel so that logs, errors, status updates, progress bars,
> execution stages etc can be displayed on the application side. We added an
> akka communication channel for this (working on PR ).
>

This is another thing I'm a little unsure about. It seems that if you
really need to, you can implement your own SparkListener to do all this,
and then you can transfer all that data to wherever you need it to be.
Exposing something like this in Spark would just mean having to support
some new RPC mechanism as a public API, which is kinda burdensome. You
mention yourself you've done something already, which is an indication that
you can do it with existing Spark APIs, which makes it not a great
candidate for a core Spark feature. I guess you can't get logs with that
mechanism, but you can use your custom log4j configuration for that if you
really want to pipe logs to a different location.

But those are a good starting point - knowing what people need is the first
step in adding a new feature. :-)

-- 
Marcelo

Reply via email to