Re: welcome a new batch of committers

2018-10-04 Thread Weiqing Yang
Congratulations everyone!

On Wed, Oct 3, 2018 at 11:14 PM, Driesprong, Fokko 
wrote:

> Congratulations all!
>
> Op wo 3 okt. 2018 om 23:03 schreef Bryan Cutler :
>
>> Congratulations everyone! Very well deserved!!
>>
>> On Wed, Oct 3, 2018, 1:59 AM Reynold Xin  wrote:
>>
>>> Hi all,
>>>
>>> The Apache Spark PMC has recently voted to add several new committers to
>>> the project, for their contributions:
>>>
>>> - Shane Knapp (contributor to infra)
>>> - Dongjoon Hyun (contributor to ORC support and other parts of Spark)
>>> - Kazuaki Ishizaki (contributor to Spark SQL)
>>> - Xingbo Jiang (contributor to Spark Core and SQL)
>>> - Yinan Li (contributor to Spark on Kubernetes)
>>> - Takeshi Yamamuro (contributor to Spark SQL)
>>>
>>> Please join me in welcoming them!
>>>
>>>


Re: Welcoming Saisai (Jerry) Shao as a committer

2017-08-28 Thread Weiqing Yang
Congratulations, Jerry!

On Mon, Aug 28, 2017 at 6:44 PM, Yanbo Liang  wrote:

> Congratulations, Jerry.
>
> On Tue, Aug 29, 2017 at 9:42 AM, John Deng  wrote:
>
>>
>> Congratulations, Jerry !
>>
>> On 8/29/2017 09:28,Matei Zaharia
>>  wrote:
>>
>> Hi everyone,
>>
>> The PMC recently voted to add Saisai (Jerry) Shao as a commi
>> tter. Saisai has been contributing to many areas of the
>> project for a long time, so it’s great to see him join.
>> Join me in thanking and congratulating him!
>>
>> Matei
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>


Re: Welcoming Hyukjin Kwon and Sameer Agarwal as committers

2017-08-08 Thread Weiqing Yang
Congratulation, Hyukjin and Sameer!

On Mon, Aug 7, 2017 at 8:56 AM, Dongjoon Hyun 
wrote:

> Great!
>
> Congratulation, Hyukjin and Sameer!
>
> Dongjoon.
>
> On Mon, Aug 7, 2017 at 8:55 AM, Bai, Dave  wrote:
>
>> Congrats, leveled up!=)
>>
>> On 8/7/17, 10:53 AM, "Matei Zaharia"  wrote:
>>
>> >Hi everyone,
>> >
>> >The Spark PMC recently voted to add Hyukjin Kwon and Sameer Agarwal as
>> >committers. Join me in congratulating both of them and thanking them for
>> >their contributions to the project!
>> >
>> >Matei
>> >-
>> >To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>


Re: Spark Hbase Connector

2017-06-29 Thread Weiqing Yang
https://github.com/hortonworks-spark/shc/releases (v1.x.x-2.1 for Spark 2.1)
https://github.com/hortonworks-spark/shc/tree/branch-2.1 (for Spark 2.1)

On Thu, Jun 29, 2017 at 4:36 PM, Ted Yu  wrote:

> Please take a look at HBASE-16179 (work in progress).
>
> On Thu, Jun 29, 2017 at 4:30 PM, Raj, Deepu 
> wrote:
>
>> Hi Team,
>>
>>
>>
>> Is there stable Spark HBase connector for Spark 2.0 ?
>>
>>
>>
>> Thanks,
>>
>> Deepu Raj
>>
>>
>>
>
>


Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-08 Thread Weiqing Yang
 +1 (non binding)


Environment: CentOS Linux release 7.0.1406 (Core) / openjdk version
"1.8.0_111"



./build/mvn -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver
-Dpyspark -Dsparkr -DskipTests clean package

./build/mvn -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver
-Dpyspark -Dsparkr test



On Tue, Nov 8, 2016 at 7:38 PM, Liwei Lin  wrote:

> +1 (non-binding)
>
> Cheers,
> Liwei
>
> On Tue, Nov 8, 2016 at 9:50 PM, Ricardo Almeida <
> ricardo.alme...@actnowib.com> wrote:
>
>> +1 (non-binding)
>>
>> over Ubuntu 16.10, Java 8 (OpenJDK 1.8.0_111) built with Hadoop 2.7.3,
>> YARN, Hive
>>
>>
>> On 8 November 2016 at 12:38, Herman van Hövell tot Westerflier <
>> hvanhov...@databricks.com> wrote:
>>
>>> +1
>>>
>>> On Tue, Nov 8, 2016 at 7:09 AM, Reynold Xin  wrote:
>>>
 Please vote on releasing the following candidate as Apache Spark
 version 2.0.2. The vote is open until Thu, Nov 10, 2016 at 22:00 PDT and
 passes if a majority of at least 3+1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 2.0.2
 [ ] -1 Do not release this package because ...


 The tag to be voted on is v2.0.2-rc3 (584354eaac02531c9584188b14336
 7ba694b0c34)

 This release candidate resolves 84 issues:
 https://s.apache.org/spark-2.0.2-jira

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-bin/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1214/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-docs/


 Q: How can I help test this release?
 A: If you are a Spark user, you can help us test this release by taking
 an existing Spark workload and running on this release candidate, then
 reporting any regressions from 2.0.1.

 Q: What justifies a -1 vote for this release?
 A: This is a maintenance release in the 2.0.x series. Bugs already
 present in 2.0.1, missing features, or bugs related to new features will
 not necessarily block this release.

 Q: What fix version should I use for patches merging into branch-2.0
 from now on?
 A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
 (i.e. RC4) is cut, I will change the fix version of those patches to 2.0.2.

>>>
>>>
>>
>


Re: [VOTE] Release Apache Spark 1.6.3 (RC2)

2016-11-04 Thread Weiqing Yang
+1 (non-binding)

Built and tested on CentOS Linux release 7.0.1406 / openjdk version
"1.8.0_111".

On Fri, Nov 4, 2016 at 9:06 AM, Ricardo Almeida <
ricardo.alme...@actnowib.com> wrote:

> +1 (non-binding)
>
> tested over Ubuntu / OpenJDK 1.8.0_111
>
> On 4 November 2016 at 10:00, Sean Owen  wrote:
>
>> Likewise, ran my usual tests on Ubuntu with 
>> yarn/hive/hive-thriftserver/hadoop-2.6
>> on JDK 8 and all passed. Sigs and licenses are OK. +1
>>
>>
>> On Thu, Nov 3, 2016 at 7:57 PM Herman van Hövell tot Westerflier <
>> hvanhov...@databricks.com> wrote:
>>
>>> +1
>>>
>>> On Thu, Nov 3, 2016 at 6:58 PM, Michael Armbrust >> > wrote:
>>>
>>> +1
>>>
>>> On Wed, Nov 2, 2016 at 5:40 PM, Reynold Xin  wrote:
>>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 1.6.3. The vote is open until Sat, Nov 5, 2016 at 18:00 PDT and passes if a
>>> majority of at least 3+1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 1.6.3
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> The tag to be voted on is v1.6.3-rc2 (1e860747458d74a4ccbd081103a05
>>> 42a2367b14b)
>>>
>>> This release candidate addresses 52 JIRA tickets:
>>> https://s.apache.org/spark-1.6.3-jira
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.3-rc2-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1212/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.3-rc2-docs/
>>>
>>>
>>> ===
>>> == How can I help test this release?
>>> ===
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions from 1.6.2.
>>>
>>> 
>>> == What justifies a -1 vote for this release?
>>> 
>>> This is a maintenance release in the 1.6.x series.  Bugs already present
>>> in 1.6.2, missing features, or bugs related to new features will not
>>> necessarily block this release.
>>>
>>>
>>>
>>>
>


Re: [VOTE] Release Apache Spark 2.0.2 (RC1)

2016-10-28 Thread Weiqing Yang
+1 (non binding)



Environment: CentOS Linux release 7.0.1406 / openjdk version "1.8.0_111"/ R
version 3.3.1


./build/mvn -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver
-Dpyspark -Dsparkr -DskipTests clean package

./build/mvn -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver
-Dpyspark -Dsparkr test


Best,

Weiqing

On Fri, Oct 28, 2016 at 10:06 AM, Ryan Blue 
wrote:

> +1 (non-binding)
>
> Checksums and build are fine. The tarball matches the release tag except
> that .gitignore is missing. It would be nice if the tarball were created
> using git archive so that the commit ref is present, but otherwise
> everything looks fine.
> ​
>
> On Thu, Oct 27, 2016 at 12:18 AM, Reynold Xin  wrote:
>
>> Greetings from Spark Summit Europe at Brussels.
>>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.0.2. The vote is open until Sun, Oct 30, 2016 at 00:30 PDT and passes if
>> a majority of at least 3+1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.0.2
>> [ ] -1 Do not release this package because ...
>>
>>
>> The tag to be voted on is v2.0.2-rc1 (1c2908eeb8890fdc91413a3f5bad2
>> bb3d114db6c)
>>
>> This release candidate resolves 75 issues: https://s.apache.org/spark-2.0
>> .2-jira
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1208/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-docs/
>>
>>
>> Q: How can I help test this release?
>> A: If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions from 2.0.1.
>>
>> Q: What justifies a -1 vote for this release?
>> A: This is a maintenance release in the 2.0.x series. Bugs already
>> present in 2.0.1, missing features, or bugs related to new features will
>> not necessarily block this release.
>>
>> Q: What fix version should I use for patches merging into branch-2.0 from
>> now on?
>> A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
>> (i.e. RC2) is cut, I will change the fix version of those patches to 2.0.2.
>>
>>
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: welcoming Xiao Li as a committer

2016-10-04 Thread Weiqing Yang
Congrats Xiao!

On Tue, Oct 4, 2016 at 6:40 AM, Kevin  wrote:

> Congratulations Xiao!!
>
> Sent from my iPhone
>
> On Oct 4, 2016, at 3:59 AM, Tarun Kumar  wrote:
>
> Congrats Xiao.
>
> Thanks
> Tarun
> On Tue, 4 Oct 2016 at 12:57 PM, Cheng Lian  wrote:
>
>> Congratulations!!!
>>
>>
>> Cheng
>>
>> On Tue, Oct 4, 2016 at 1:46 PM, Reynold Xin  wrote:
>>
>> Hi all,
>>
>> Xiao Li, aka gatorsmile, has recently been elected as an Apache Spark
>> committer. Xiao has been a super active contributor to Spark SQL. Congrats
>> and welcome, Xiao!
>>
>> - Reynold
>>
>>
>>


Re: [discuss] Spark 2.x release cadence

2016-09-29 Thread Weiqing Yang
Sorry. I think I just replied to the wrong thread. :(


WQ

On Thu, Sep 29, 2016 at 10:58 AM, Weiqing Yang <yangweiqing...@gmail.com>
wrote:

> +1 (non binding)
>
>
>
> RC4 is compiled and tested on the system: CentOS Linux release
> 7.0.1406 / openjdk 1.8.0_102 / R 3.3.1
>
>  All tests passed.
>
>
>
> ./build/mvn -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver
> -Dpyspark -Dsparkr -DskipTests clean package
>
> ./build/mvn -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver
> -Dpyspark -Dsparkr test
>
>
>
>
>
> Best,
>
> Weiqing
>
> On Thu, Sep 29, 2016 at 8:02 AM, Cody Koeninger <c...@koeninger.org>
> wrote:
>
>> Regarding documentation debt, is there a reason not to deploy
>> documentation updates more frequently than releases?  I recall this
>> used to be the case.
>>
>> On Wed, Sep 28, 2016 at 3:35 PM, Joseph Bradley <jos...@databricks.com>
>> wrote:
>> > +1 for 4 months.  With QA taking about a month, that's very reasonable.
>> >
>> > My main ask (especially for MLlib) is for contributors and committers to
>> > take extra care not to delay on updating the Programming Guide for new
>> APIs.
>> > Documentation debt often collects and has to be paid off during QA, and
>> a
>> > longer cycle will exacerbate this problem.
>> >
>> > On Wed, Sep 28, 2016 at 7:30 AM, Tom Graves
>> <tgraves...@yahoo.com.invalid>
>> > wrote:
>> >>
>> >> +1 to 4 months.
>> >>
>> >> Tom
>> >>
>> >>
>> >> On Tuesday, September 27, 2016 2:07 PM, Reynold Xin <
>> r...@databricks.com>
>> >> wrote:
>> >>
>> >>
>> >> We are 2 months past releasing Spark 2.0.0, an important milestone for
>> the
>> >> project. Spark 2.0.0 deviated (took 6 month from the regular release
>> cadence
>> >> we had for the 1.x line, and we never explicitly discussed what the
>> release
>> >> cadence should look like for 2.x. Thus this email.
>> >>
>> >> During Spark 1.x, roughly every three months we make a new 1.x feature
>> >> release (e.g. 1.5.0 comes out three months after 1.4.0). Development
>> >> happened primarily in the first two months, and then a release branch
>> was
>> >> cut at the end of month 2, and the last month was reserved for QA and
>> >> release preparation.
>> >>
>> >> During 2.0.0 development, I really enjoyed the longer release cycle
>> >> because there was a lot of major changes happening and the longer time
>> was
>> >> critical for thinking through architectural changes as well as API
>> design.
>> >> While I don't expect the same degree of drastic changes in a 2.x
>> feature
>> >> release, I do think it'd make sense to increase the length of release
>> cycle
>> >> so we can make better designs.
>> >>
>> >> My strawman proposal is to maintain a regular release cadence, as we
>> did
>> >> in Spark 1.x, and increase the cycle from 3 months to 4 months. This
>> >> effectively gives us ~50% more time to develop (in reality it'd be
>> slightly
>> >> less than 50% since longer dev time also means longer QA time). As for
>> >> maintenance releases, I think those should still be cut on-demand,
>> similar
>> >> to Spark 1.x, but more aggressively.
>> >>
>> >> To put this into perspective, 4-month cycle means we will release Spark
>> >> 2.1.0 at the end of Nov or early Dec (and branch cut / code freeze at
>> the
>> >> end of Oct).
>> >>
>> >> I am curious what others think.
>> >>
>> >>
>> >>
>> >>
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>


Re: [discuss] Spark 2.x release cadence

2016-09-29 Thread Weiqing Yang
+1 (non binding)



RC4 is compiled and tested on the system: CentOS Linux release
7.0.1406 / openjdk 1.8.0_102 / R 3.3.1

 All tests passed.



./build/mvn -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver
-Dpyspark -Dsparkr -DskipTests clean package

./build/mvn -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver
-Dpyspark -Dsparkr test





Best,

Weiqing

On Thu, Sep 29, 2016 at 8:02 AM, Cody Koeninger  wrote:

> Regarding documentation debt, is there a reason not to deploy
> documentation updates more frequently than releases?  I recall this
> used to be the case.
>
> On Wed, Sep 28, 2016 at 3:35 PM, Joseph Bradley 
> wrote:
> > +1 for 4 months.  With QA taking about a month, that's very reasonable.
> >
> > My main ask (especially for MLlib) is for contributors and committers to
> > take extra care not to delay on updating the Programming Guide for new
> APIs.
> > Documentation debt often collects and has to be paid off during QA, and a
> > longer cycle will exacerbate this problem.
> >
> > On Wed, Sep 28, 2016 at 7:30 AM, Tom Graves  >
> > wrote:
> >>
> >> +1 to 4 months.
> >>
> >> Tom
> >>
> >>
> >> On Tuesday, September 27, 2016 2:07 PM, Reynold Xin <
> r...@databricks.com>
> >> wrote:
> >>
> >>
> >> We are 2 months past releasing Spark 2.0.0, an important milestone for
> the
> >> project. Spark 2.0.0 deviated (took 6 month from the regular release
> cadence
> >> we had for the 1.x line, and we never explicitly discussed what the
> release
> >> cadence should look like for 2.x. Thus this email.
> >>
> >> During Spark 1.x, roughly every three months we make a new 1.x feature
> >> release (e.g. 1.5.0 comes out three months after 1.4.0). Development
> >> happened primarily in the first two months, and then a release branch
> was
> >> cut at the end of month 2, and the last month was reserved for QA and
> >> release preparation.
> >>
> >> During 2.0.0 development, I really enjoyed the longer release cycle
> >> because there was a lot of major changes happening and the longer time
> was
> >> critical for thinking through architectural changes as well as API
> design.
> >> While I don't expect the same degree of drastic changes in a 2.x feature
> >> release, I do think it'd make sense to increase the length of release
> cycle
> >> so we can make better designs.
> >>
> >> My strawman proposal is to maintain a regular release cadence, as we did
> >> in Spark 1.x, and increase the cycle from 3 months to 4 months. This
> >> effectively gives us ~50% more time to develop (in reality it'd be
> slightly
> >> less than 50% since longer dev time also means longer QA time). As for
> >> maintenance releases, I think those should still be cut on-demand,
> similar
> >> to Spark 1.x, but more aggressively.
> >>
> >> To put this into perspective, 4-month cycle means we will release Spark
> >> 2.1.0 at the end of Nov or early Dec (and branch cut / code freeze at
> the
> >> end of Oct).
> >>
> >> I am curious what others think.
> >>
> >>
> >>
> >>
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Add Caller Context in Spark

2016-06-09 Thread Weiqing Yang
Yes, it is a string. Jira SPARK-15857
<https://issues.apache.org/jira/browse/SPARK-15857> is created.

Thanks,
WQ

On Thu, Jun 9, 2016 at 4:40 PM, Reynold Xin <r...@databricks.com> wrote:

> Is this just to set some string? That makes sense. One thing you would
> need to make sure is that Spark should still work outside of Hadoop, and
> also in older versions of Hadoop.
>
> On Thu, Jun 9, 2016 at 4:37 PM, Weiqing Yang <yangweiqing...@gmail.com>
> wrote:
>
>> Hi,
>>
>> Hadoop has implemented a feature of log tracing – caller context (Jira:
>> HDFS-9184 <https://issues.apache.org/jira/browse/HDFS-9184> and YARN-4349
>> <https://issues.apache.org/jira/browse/YARN-4349>). The motivation is to
>> better diagnose and understand how specific applications impacting parts of
>> the Hadoop system and potential problems they may be creating (e.g.
>> overloading NN). As HDFS mentioned inHDFS-9184
>> <https://issues.apache.org/jira/browse/HDFS-9184>, for a given HDFS
>> operation, it's very helpful to track which upper level job issues it. The
>> upper level callers may be specific Oozie tasks, MR jobs, hive queries,
>> Spark jobs.
>>
>> Hadoop ecosystems like MapReduce, Tez (TEZ-2851
>> <https://issues.apache.org/jira/browse/TEZ-2851>), Hive (HIVE-12249
>> <https://issues.apache.org/jira/browse/HIVE-12249>, HIVE-12254
>> <https://issues.apache.org/jira/browse/HIVE-12254>) and Pig(PIG-4714
>> <https://issues.apache.org/jira/browse/PIG-4714>) have implemented their
>> caller contexts. Those systems invoke HDFS client API and Yarn client API
>> to setup caller context, and also expose an API to pass in caller context
>> into it.
>>
>> Lots of Spark applications are running on Yarn/HDFS. Spark can also
>> implement its caller context via invoking HDFS/Yarn API, and also expose an
>> API to its upstream applications to set up their caller contexts. In the
>> end, the spark caller context written into Yarn log / HDFS log can
>> associate with task id, stage id, job id and app id.  That is also very
>> good for Spark users to identify tasks especially if Spark supports
>> multi-tenant environment in the future.
>>
>> e.g.  Run SparkKmeans on Spark.
>>
>> In HDFS log:
>> …
>> 2016-05-25 15:36:23,748 INFO FSNamesystem.audit: allowed=true
>> ugi=yang(auth:SIMPLE)ip=/127.0.0.1cmd=getfileinfo
>> src=/data/mllib/kmeans_data.txt/_spark_metadata   dst=null
>> perm=null  proto=rpc callerContext=SparkKMeans
>> application_1464728991691_0009 running on Spark
>>
>>  2016-05-25 15:36:27,893 INFO FSNamesystem.audit: allowed=true
>> ugi=yang (auth:SIMPLE)ip=/127.0.0.1cmd=open
>> src=/data/mllib/kmeans_data.txt   dst=null   perm=null
>> proto=rpc
>> callerContext=JobID_0_stageID_0_stageAttemptId_0_taskID_0_attemptNumber_0 on
>> Spark
>> …
>>
>> “application_146472899169” is the application id.
>>
>> I do have code that works with spark master branch. I am going to create
>> a Jira. Please feel free to let me know if you have any concern or
>> comments.
>>
>> Thanks,
>> Qing
>>
>
>


Add Caller Context in Spark

2016-06-09 Thread Weiqing Yang
Hi,

Hadoop has implemented a feature of log tracing – caller context (Jira:
HDFS-9184  and YARN-4349
). The motivation is to
better diagnose and understand how specific applications impacting parts of
the Hadoop system and potential problems they may be creating (e.g.
overloading NN). As HDFS mentioned inHDFS-9184
, for a given HDFS
operation, it's very helpful to track which upper level job issues it. The
upper level callers may be specific Oozie tasks, MR jobs, hive queries,
Spark jobs.

Hadoop ecosystems like MapReduce, Tez (TEZ-2851
), Hive (HIVE-12249
, HIVE-12254
) and Pig(PIG-4714
) have implemented their
caller contexts. Those systems invoke HDFS client API and Yarn client API
to setup caller context, and also expose an API to pass in caller context
into it.

Lots of Spark applications are running on Yarn/HDFS. Spark can also
implement its caller context via invoking HDFS/Yarn API, and also expose an
API to its upstream applications to set up their caller contexts. In the
end, the spark caller context written into Yarn log / HDFS log can
associate with task id, stage id, job id and app id.  That is also very
good for Spark users to identify tasks especially if Spark supports
multi-tenant environment in the future.

e.g.  Run SparkKmeans on Spark.

In HDFS log:
…
2016-05-25 15:36:23,748 INFO FSNamesystem.audit: allowed=true
ugi=yang(auth:SIMPLE)ip=/127.0.0.1cmd=getfileinfo
src=/data/mllib/kmeans_data.txt/_spark_metadata   dst=null
perm=null  proto=rpc callerContext=SparkKMeans
application_1464728991691_0009 running on Spark

 2016-05-25 15:36:27,893 INFO FSNamesystem.audit: allowed=true
ugi=yang (auth:SIMPLE)ip=/127.0.0.1cmd=open
src=/data/mllib/kmeans_data.txt   dst=null   perm=null
proto=rpc
callerContext=JobID_0_stageID_0_stageAttemptId_0_taskID_0_attemptNumber_0 on
Spark
…

“application_146472899169” is the application id.

I do have code that works with spark master branch. I am going to create a
Jira. Please feel free to let me know if you have any concern or comments.

Thanks,
Qing