[ 
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15963304#comment-15963304
 ] 

Ryan Williams commented on SPARK-650:
-------------------------------------

Both suggested workarounds here are lacking or broken / actively harmful, 
afaict, and the use case is real and valid.

The ADAM project struggled for >2 years with this problem:

- [a 3rd-party {{OutputFormat}} required this field to be 
set|https://github.com/HadoopGenomics/Hadoop-BAM/blob/eb688fb90c60e8c956f9d1e4793fea01e3164056/src/main/java/org/seqdoop/hadoop_bam/KeyIgnoringAnySAMOutputFormat.java#L93]
- the value of the field is computed on the driver, and needs to somehow be 
sent to and set in each executor JVM.

h3. {{mapPartitions}} hack

[Some attempts to set the field via a dummy {{mapPartitions}} 
job|https://github.com/hammerlab/adam/blob/b87bfb72c7411b5ea088b12334aa1b548102eb4b/adam-core/src/main/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDFunctions.scala#L134-L146]
 actually added [pernicious, non-deterministic 
bugs|https://github.com/bigdatagenomics/adam/issues/676#issuecomment-219347677].

In general Spark seems to provide no guarantees that ≥1 tasks will get 
scheduled on each executor in such a situation:

- in the above, node locality resulted in some executors being missed
- dynamic-allocation also offers chances for executors to come online later and 
never be initialized

h3. object/singleton initialization

How can one use singleton initialization to pass an object from the driver to 
each executor? Maybe I've missed this in the discussion above.

In the end, ADAM decided to write the object to a file and route that file's 
path to the {{OutputFormat}} via a hadoop configuration value, which is pretty 
inelegant.

> Add a "setup hook" API for running initialization code on each executor
> -----------------------------------------------------------------------
>
>                 Key: SPARK-650
>                 URL: https://issues.apache.org/jira/browse/SPARK-650
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>            Reporter: Matei Zaharia
>            Priority: Minor
>
> Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to