Github user chesterxgchen commented on the pull request:
https://github.com/apache/spark/pull/2786#issuecomment-62060857
Tom,
thanks for reviewing.
I am still working on the second PR, which I haven't submitted yet. The
code is currently used in our application and I am pulling them out from our
code and make a PR for it. The current code only uses Akka to do the
communication, I would like to add the Netty support as well before I submit
the Pull Request, that's why I haven't submit it yet.
The followings are the use cases in our application, which show how the
new APIs are used. I assume other applications will have the similar use cases.
Our application doesn't use spark-submit command line to run spark. We
submit both hadoop and spark job directly from our servlet application
(jetty). We are deploying in Yarn Cluster Mode. We invoke the Spark Client ( in
yarn module) directly. Client can't call System.exists, which will shutdown the
jetty JVM.
Our application will submit and stop spark job, monitoring the spark job
progress, get the states from the spark jobs ( for example, bad data counters
), logging and exceptions. So far the communication is one way (direction)
after the job is submitted; we will move to two-ways communication soon. ( for
example, a long running spark context, with different short spark actions:
distinct counts, samplings, filters, transformations, etc. a pipeline of
actions but need to be feedback on each action ( visualization etc.) )
In this particular Pull Request, we only address very limited requirements,
the next PR will address the rest of communication mentioned above.
1) Get Yarn Container Capacities before submit Yarn Applications
In our Spark Job, we can use this callback in two ways:
A) we cap the request memory usage if the request is too large. For
example, if the spark.executor.memory supplied by client is larger than the
Yarn Container max memory, we reset the spark.executor.memory to yarn max
container max memory minus over head and send a message to the Application (
UI message) tell the user that we reset the memory. Or we could simply throw
exception without submit the job.
users might be use the information about virtual cores to do other
validation.
2) We can dynamically estimate the executor memory based on data size (
if you have the information from prev processing steps) and max memory
available; rather than directly use the fix memory size and potentially get
kill if they are too large.
2) Add some callback via listener to monitoring Yarn application progress
We are using the listener call back do show progress ( limited
information)
1) We have tracking URL from Yarn that can bring us directly to the
Hadoop Cluster Application management page.
As soon as the Yarn container is created and job is submitted, we
have tracking URL from Yarn ( we need to watch out for invalid URL), at this
point you can put the URL in the UI, even the Spark job is not started yet.
2) We display the progress bar on the UI with the callback
For CDH5, we only got 0%, 10% and 100%, not very useful, but still some
earlier feedback for customer.
3)We get the Yarn Application ID when the job is started, which can be
used for tracking progress or kill the app.
( with next PR, we are able to directly using the tracking URL to open the
spark UI page, show spark job iterations and spark specific progress etc.,
currently all above are implemented in our application.)
3) expose Yarn Kill Application API
Yes, you can directly invoke from command line with
yarn kill -applicationId appid.
But since we need to call from our application, we need a API to do
this. In our application, if client start the job and then decided to stop it
( running too long, change parameters etc.), we have to use kill API to kill
it, as stop API doesn't stop it.
Hope this will give you a better picture as to why this PR is important to
us.
I will move faster with next PR mentioned.
Thanks
Chester
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]