[jira] [Comment Edited] (SPARK-650) Add a "setup hook" API for running initialization code on each executor

Ryan Williams (JIRA) Mon, 10 Apr 2017 11:42:54 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15963304#comment-15963304
 ]


Ryan Williams edited comment on SPARK-650 at 4/10/17 6:42 PM:
--------------------------------------------------------------

Both suggested workarounds here are lacking or broken / actively harmful, 
afaict, and the use case is real and valid.

The ADAM project struggled for >2 years with this problem:

- [a 3rd-party {{OutputFormat}} required this field to be 
set|https://github.com/HadoopGenomics/Hadoop-BAM/blob/eb688fb90c60e8c956f9d1e4793fea01e3164056/src/main/java/org/seqdoop/hadoop_bam/KeyIgnoringAnySAMOutputFormat.java#L93]
- the value of the field is computed on the driver, and needs to somehow be 
sent to and set in each executor JVM.

h3. {{mapPartitions}} hack

[Some attempts to set the field via a dummy {{mapPartitions}} 
job|https://github.com/hammerlab/adam/blob/b87bfb72c7411b5ea088b12334aa1b548102eb4b/adam-core/src/main/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDFunctions.scala#L134-L146]
 actually added [pernicious, non-deterministic 
bugs|https://github.com/bigdatagenomics/adam/issues/676#issuecomment-219347677].

In general Spark seems to provide no guarantees that ≥1 tasks will get 
scheduled on each executor in such a situation:

- in the above, node locality resulted in some executors being missed
- dynamic-allocation also offers chances for executors to come online later and 
never be initialized

h3. object/singleton initialization

How can one use singleton initialization to pass an object from the driver to 
each executor? Maybe I've missed this in the discussion above.

In the end, ADAM decided to write the object to a file and route that file's 
path to the {{OutputFormat}} via a hadoop configuration value, which is pretty 
inelegant.

h4. Another use case

I have another need for this atm where regular lazy-object-initialization is 
also insufficient: [due to a rough-edge in Scala programs' classloader 
configuration, {{FileSystemProvider}}'s in user JARs are not loaded 
properly|https://github.com/scala/bug/issues/10247].

[A workaround discussed in the 1st post on that issue fixes the 
problem|https://github.com/hammerlab/spark-commands/blob/1.0.3/src/main/scala/org/hammerlab/commands/FileSystems.scala#L8-L20],
 but needs to be run before {{FileSystemProvider.installedProviders}} is first 
called on the JVM, which can be triggered by numerous {{java.nio.file}} 
operations.

I don't see a clear way to work in code in that will always lazily call my 
{{FileSystems.load}} function on each executor, let alone ensure that it 
happens before any code in the JAR calls e.g.
 {{Paths.get}}.


was (Author: rdub):
Both suggested workarounds here are lacking or broken / actively harmful, 
afaict, and the use case is real and valid.

The ADAM project struggled for >2 years with this problem:

- [a 3rd-party {{OutputFormat}} required this field to be 
set|https://github.com/HadoopGenomics/Hadoop-BAM/blob/eb688fb90c60e8c956f9d1e4793fea01e3164056/src/main/java/org/seqdoop/hadoop_bam/KeyIgnoringAnySAMOutputFormat.java#L93]
- the value of the field is computed on the driver, and needs to somehow be 
sent to and set in each executor JVM.

h3. {{mapPartitions}} hack

[Some attempts to set the field via a dummy {{mapPartitions}} 
job|https://github.com/hammerlab/adam/blob/b87bfb72c7411b5ea088b12334aa1b548102eb4b/adam-core/src/main/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDFunctions.scala#L134-L146]
 actually added [pernicious, non-deterministic 
bugs|https://github.com/bigdatagenomics/adam/issues/676#issuecomment-219347677].

In general Spark seems to provide no guarantees that ≥1 tasks will get 
scheduled on each executor in such a situation:

- in the above, node locality resulted in some executors being missed
- dynamic-allocation also offers chances for executors to come online later and 
never be initialized

h3. object/singleton initialization

How can one use singleton initialization to pass an object from the driver to 
each executor? Maybe I've missed this in the discussion above.

In the end, ADAM decided to write the object to a file and route that file's 
path to the {{OutputFormat}} via a hadoop configuration value, which is pretty 
inelegant.

> Add a "setup hook" API for running initialization code on each executor
> -----------------------------------------------------------------------
>
>                 Key: SPARK-650
>                 URL: https://issues.apache.org/jira/browse/SPARK-650
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>            Reporter: Matei Zaharia
>            Priority: Minor
>
> Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-650) Add a "setup hook" API for running initialization code on each executor

Reply via email to