GitHub user Sherry302 opened a pull request:
https://github.com/apache/spark/pull/14312
[SPARK-15857]Add caller context in Spark: invoke YARN/HDFS API to setâ¦
## What changes were proposed in this pull request?
1. Pass 'jobId' to Task.
2. Add a new function 'setCallerContext' in Utils. 'setCallerContext'
function will call APIs of 'org.apache.hadoop.ipc.CallerContext' to set up
spark caller contexts, which will be written into HDFS hdfs-audit.log or Yarn
resource manager log.
3. 'setCallerContext' function will be called in Yarn client,
ApplicationMaster, and Task class.
The Spark caller context written into HDFS log will be
"JobID_stageID_stageAttemptId_taskID_attemptNumbe on Spark", and the Spark
caller context written into Yarn log will be"{spark.app.name} running on Spark".
## How was this patch tested?
Manual Tests against some Spark applications in Yarn client mode and
cluster mode. Need to check if spark caller contexts were written into HDFS
hdfs-audit.log and Yarn resource manager log successfully.
For example, run SparkKmeans on Spark:
In Yarn resource manager log, there will be a record with the spark caller
context.
...
2016-07-21 13:36:26,318 INFO
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=wyang
IP=127.0.0.1 OPERATION=Submit Application Request
TARGET=ClientRMService RESULT=SUCCESS
APPID=application_1469125587135_0004 CALLERCONTEXT=SparkKMeans running on
Spark
...
In HDFS hdfs-audit.log, there will be records with spark caller contexts.
...
2016-07-21 13:38:30,799 INFO FSNamesystem.audit: allowed=true
ugi=wyang (auth:SIMPLE) ip=/127.0.0.1 cmd=getfileinfo
src=/lr_big.txt/_spark_metadata dst=null perm=null
proto=rpc callerContext=SparkKMeans running on Spark
...
2016-07-21 13:39:35,584 INFO FSNamesystem.audit: allowed=true
ugi=wyang (auth:SIMPLE) ip=/127.0.0.1 cmd=open src=/lr_big.txt
dst=null perm=null proto=rpc
callerContext=JobId_0_StageID_0_stageAttemptId_0_taskID_1_attemptNumber_0 on
Spark
...
If the hadoop version on which Spark runs does not have CallerContext APIs,
there will be no information of Spark caller context in those logs.
⦠up caller context
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/Sherry302/spark master
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/14312.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #14312
----
commit 38c4f58dbf30d541260ee1b0381993a9bec393f8
Author: Weiqing Yang <[email protected]>
Date: 2016-07-22T01:21:03Z
[SPARK-15857]Add caller context in Spark: invoke YARN/HDFS API to set up
caller context
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]