[jira] [Comment Edited] (SPARK-650) Add a "setup hook" API for running initialization code on each executor

JIRA Thu, 20 Apr 2017 08:32:27 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15976902#comment-15976902
 ]


Michael Schmeißer edited comment on SPARK-650 at 4/20/17 3:31 PM:
------------------------------------------------------------------

In a nutshell, we have our own class "MySerializer" which is derived from 
{{org.apache.spark.serializer.JavaSerializer}} and performs our custom 
initialization in `MySerializer#newInstance` before calling the super method 
`com.gfk.st2.pace.df.jobflow.orch.spark.api.ClosureSerializerAsInitHook#newInstance`.
 Then, when building the SparkConf for initialization of the SparkContext, we 
add `pSparkConf.set("spark.closure.serializer", 
MySerializer.class.getCanonicalName());`.

We package this with our application JAR and it works. So I think you have to 
look at your classpath configuration [~mboes]. In our case, the JAR which 
contains the closure serializer is listeed in the following properties:
* driver.extraClassPath
* executor.extraClassPath
* yarn.secondary.jars
* spark.yarn.secondary.jars
* spark.driver.extraClassPath
* spark.executor.extraClassPath

If I recall it correctly, the variants without the "spark." prefix are produced 
by us because we prefix all of our properties with "spark." to transfer them 
via Oozie and unmask them again later, so you should only need the properties 
with the "spark." prefix.

Regarding the questions of [~riteshtijoriwala]: 1) Please see the related issue 
SPARK-1107. 2) You can add a TaskCompletionListener with 
`org.apache.spark.TaskContext#addTaskCompletionListener(org.apache.spark.util.TaskCompletionListener)`.
 To get the current TaskContext on the executor, just use 
`org.apache.spark.TaskContext#get`. We have some functionality to log the 
progress of a function in fixed intervals (e.g. every 1,000 records). To do 
this, you can use mapPartitions with a custom iterator. 


was (Author: skamandros):
In a nutshell, we have our own class "MySerializer" which is derived from 
`org.apache.spark.serializer.JavaSerializer` and performs our custom 
initialization in `MySerializer#newInstance` before calling the super method 
`com.gfk.st2.pace.df.jobflow.orch.spark.api.ClosureSerializerAsInitHook#newInstance`.
 Then, when building the SparkConf for initialization of the SparkContext, we 
add `pSparkConf.set("spark.closure.serializer", 
MySerializer.class.getCanonicalName());`.

We package this with our application JAR and it works. So I think you have to 
look at your classpath configuration [~mboes]. In our case, the JAR which 
contains the closure serializer is listeed in the following properties:
* driver.extraClassPath
* executor.extraClassPath
* yarn.secondary.jars
* spark.yarn.secondary.jars
* spark.driver.extraClassPath
* spark.executor.extraClassPath

If I recall it correctly, the variants without the "spark." prefix are produced 
by us because we prefix all of our properties with "spark." to transfer them 
via Oozie and unmask them again later, so you should only need the properties 
with the "spark." prefix.

Regarding the questions of [~riteshtijoriwala]: 1) Please see the related issue 
SPARK-1107. 2) You can add a TaskCompletionListener with 
`org.apache.spark.TaskContext#addTaskCompletionListener(org.apache.spark.util.TaskCompletionListener)`.
 To get the current TaskContext on the executor, just use 
`org.apache.spark.TaskContext#get`. We have some functionality to log the 
progress of a function in fixed intervals (e.g. every 1,000 records). To do 
this, you can use mapPartitions with a custom iterator. 

> Add a "setup hook" API for running initialization code on each executor
> -----------------------------------------------------------------------
>
>                 Key: SPARK-650
>                 URL: https://issues.apache.org/jira/browse/SPARK-650
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>            Reporter: Matei Zaharia
>            Priority: Minor
>
> Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-650) Add a "setup hook" API for running initialization code on each executor

Reply via email to