[
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15963304#comment-15963304
]
Ryan Williams edited comment on SPARK-650 at 4/10/17 6:42 PM:
--------------------------------------------------------------
Both suggested workarounds here are lacking or broken / actively harmful,
afaict, and the use case is real and valid.
The ADAM project struggled for >2 years with this problem:
- [a 3rd-party {{OutputFormat}} required this field to be
set|https://github.com/HadoopGenomics/Hadoop-BAM/blob/eb688fb90c60e8c956f9d1e4793fea01e3164056/src/main/java/org/seqdoop/hadoop_bam/KeyIgnoringAnySAMOutputFormat.java#L93]
- the value of the field is computed on the driver, and needs to somehow be
sent to and set in each executor JVM.
h3. {{mapPartitions}} hack
[Some attempts to set the field via a dummy {{mapPartitions}}
job|https://github.com/hammerlab/adam/blob/b87bfb72c7411b5ea088b12334aa1b548102eb4b/adam-core/src/main/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDFunctions.scala#L134-L146]
actually added [pernicious, non-deterministic
bugs|https://github.com/bigdatagenomics/adam/issues/676#issuecomment-219347677].
In general Spark seems to provide no guarantees that ≥1 tasks will get
scheduled on each executor in such a situation:
- in the above, node locality resulted in some executors being missed
- dynamic-allocation also offers chances for executors to come online later and
never be initialized
h3. object/singleton initialization
How can one use singleton initialization to pass an object from the driver to
each executor? Maybe I've missed this in the discussion above.
In the end, ADAM decided to write the object to a file and route that file's
path to the {{OutputFormat}} via a hadoop configuration value, which is pretty
inelegant.
h4. Another use case
I have another need for this atm where regular lazy-object-initialization is
also insufficient: [due to a rough-edge in Scala programs' classloader
configuration, {{FileSystemProvider}}'s in user JARs are not loaded
properly|https://github.com/scala/bug/issues/10247].
[A workaround discussed in the 1st post on that issue fixes the
problem|https://github.com/hammerlab/spark-commands/blob/1.0.3/src/main/scala/org/hammerlab/commands/FileSystems.scala#L8-L20],
but needs to be run before {{FileSystemProvider.installedProviders}} is first
called on the JVM, which can be triggered by numerous {{java.nio.file}}
operations.
I don't see a clear way to work in code in that will always lazily call my
{{FileSystems.load}} function on each executor, let alone ensure that it
happens before any code in the JAR calls e.g.
{{Paths.get}}.
was (Author: rdub):
Both suggested workarounds here are lacking or broken / actively harmful,
afaict, and the use case is real and valid.
The ADAM project struggled for >2 years with this problem:
- [a 3rd-party {{OutputFormat}} required this field to be
set|https://github.com/HadoopGenomics/Hadoop-BAM/blob/eb688fb90c60e8c956f9d1e4793fea01e3164056/src/main/java/org/seqdoop/hadoop_bam/KeyIgnoringAnySAMOutputFormat.java#L93]
- the value of the field is computed on the driver, and needs to somehow be
sent to and set in each executor JVM.
h3. {{mapPartitions}} hack
[Some attempts to set the field via a dummy {{mapPartitions}}
job|https://github.com/hammerlab/adam/blob/b87bfb72c7411b5ea088b12334aa1b548102eb4b/adam-core/src/main/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDFunctions.scala#L134-L146]
actually added [pernicious, non-deterministic
bugs|https://github.com/bigdatagenomics/adam/issues/676#issuecomment-219347677].
In general Spark seems to provide no guarantees that ≥1 tasks will get
scheduled on each executor in such a situation:
- in the above, node locality resulted in some executors being missed
- dynamic-allocation also offers chances for executors to come online later and
never be initialized
h3. object/singleton initialization
How can one use singleton initialization to pass an object from the driver to
each executor? Maybe I've missed this in the discussion above.
In the end, ADAM decided to write the object to a file and route that file's
path to the {{OutputFormat}} via a hadoop configuration value, which is pretty
inelegant.
> Add a "setup hook" API for running initialization code on each executor
> -----------------------------------------------------------------------
>
> Key: SPARK-650
> URL: https://issues.apache.org/jira/browse/SPARK-650
> Project: Spark
> Issue Type: New Feature
> Components: Spark Core
> Reporter: Matei Zaharia
> Priority: Minor
>
> Would be useful to configure things like reporting libraries
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]