GitHub user chesterxgchen opened a pull request:
https://github.com/apache/spark/pull/2786
[SPARK-3913] Spark Yarn Client API change to expose Yarn Resource Capacity,
Yarn Application Listener and KillApplication APIs
Spark Yarn Client API change to expose Yarn Resource Capacity, Yarn
Application Listener and KillApplication APIs
When working with Spark with Yarn cluster mode, we have following issues:
1) We don't know how much yarn max capacity ( memory and cores) before we
specify the number of executor and memories for spark drivers and executors. We
we set a big number, the job can potentially exceeds the limit and got killed.
It would be better we let the application know that the yarn resource
capacity a head of time and the spark config can adjusted dynamically.
2) Once job started, we would like some feedback from yarn application.
Currently, the spark client basically block the call and returns when the job
is finished or failed or killed.
If the job runs for few hours, we have no idea how far it has gone, the
progress and resource usage, tracking URL etc. This Pull Request will not
complete solve the issue #2, but it will allow expose Yarn Application status:
such as when the job is started, killed, finished, the tracking URL etc, some
limited progress reporting ( for CDH5 we found the progress only reports 0, 10
and 100%)
I will have another Pull Request to address the Yarn Application and Spark
Job communication issue, that's not covered here.
3) If we decide to stop the spark job, the Spark Yarn Client expose a stop
method. But the stop method, in many cases, does not stop the yarn application.
So we need to expose the yarn client's killApplication() API to spark
client.
The proposed change is to change Client Constructor, change the first
argument from ClientArguments to
YarnResourceCapacity => ClientArguments
Were YarnResourceCapacity contains yarn's max memory and virtual cores as
well as overheads.
This allows application to adjust the memory and core settings accordingly.
For existing application that ignore the YarnResourceCapacity the
def toArgs (capacity: YarnResourceCapacity) = new ClientArguments(...)
We also defined the YarnApplicationListener interface that expose some of
the information about YarnApplicationReport.
Client.addYarnApplicaitonListener(listerner)
will allow them to get call back at different state of the application,
so they can react accordingly.
For example, onApplicationInit() the callback will invoked when the AppId
is available but application is not yet started. Once can use this AppId to
kill the application if the run is not longer desired.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/AlpineNow/spark SPARK-3913
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/2786.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2786
----
commit fd66c16a34af149e16b2af8de742044ea32dd332
Author: chesterxgchen <[email protected]>
Date: 2014-10-12T03:33:05Z
[SPARK-3913]
Spark Yarn Client API change to expose Yarn Resource Capacity, Yarn
Application Listener and KillApplication APIs
When working with Spark with Yarn cluster mode, we have following issues:
1) We don't know how much yarn max capacity ( memory and cores) before we
specify the number of executor and memories for spark drivers and executors. We
we set a big number, the job can potentially exceeds the limit and got killed.
It would be better we let the application know that the yarn resource
capacity a head of time and the spark config can adjusted dynamically.
2) Once job started, we would like some feedback from yarn application.
Currently, the spark client basically block the call and returns when the job
is finished or failed or killed.
If the job runs for few hours, we have no idea how far it has gone, the
progress and resource usage, tracking URL etc. This Pull Request will not
complete solve the issue #2, but it will allow expose Yarn Application status:
such as when the job is started, killed, finished, the tracking URL etc, some
limited progress reporting ( for CDH5 we found the progress only reports 0, 10
and 100%)
I will have another Pull Request to address the Yarn Application and
Spark Job communication issue, that's not covered here.
3) If we decide to stop the spark job, the Spark Yarn Client expose a stop
method. But the stop method, in many cases, does not the yarn application.
So we need to expose the yarn client's killApplication() API to spark
client.
The proposed change is to change Client Constructor, change the first
argument from ClientArguments to
YarnResourceCapacity => ClientArguments
Were YarnResourceCapacity contains yarn's max memory and virtual cores as
well as overheads.
This allows application to adjust the memory and core settings accordingly.
For existing application that ignore the YarnResourceCapacity the
def toArgs (capacity: YarnResourceCapacity) = new ClientArguments(...)
We also defined the YarnApplicationListener interface that expose some of
the information about YarnApplicationReport.
Client.addYarnApplicaitonListener(listern)
will allow them to get call back at different state of the application,
so they can react accordingly.
For example, onApplicationInit() the callback will invoked when the AppId
is available but application is not yet started. Once can use this AppId to
kill the application if the run is not longer desired.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]