Sean Owen commented on SPARK-650:

Reopening doesn't do anything by itself, or cause anyone to consider this. If 
this just sits for another year, it will have been a tiny part of a larger 
problem. I would ask those asking to keep this open to advance the discussion, 
or else I think you'd agree it eventually should be closed. (Here, I'm really 
speaking about hundreds of issues like this here, not so much this one.)

Part of the problem is that I don't think the details of this feature request 
were ever elaborated. I think that if you dig into what it would mean, you'd 
find that a) it's kind of tricky to define and then implement all the right 
semantics, and b) almost any use case along these lines in my experience is 
resolved as I suggest, with a simple per-JVM initialization. If the response 
lately here is, well, we're not quite sure how that works, then we need to get 
to the bottom of that, not just insisting an issue stay open.

To your points:

- The executor is going to load user code into one classloader, so we do have 
that an executor = JVM = classloader. 
- You can fail things as fast as you like by invoking this init as soon as like 
in your app.
- It's clear where things execute, or else, we must assume app developers 
understand this or else all bets are off. The driver program executes things in 
the driver unless they're part of a distributed map() etc operation, which 
clearly execute on the executor.

These IMHO aren't reasons to design a new, different, bespoke mechanism. That 
has a cost too, if you're positing that it's hard to understand when things run 

The one catch I see is that, by design, we don't control which tasks run on 
what executors. We can't guarantee init code runs on all executors this way. 
But, is it meaningful to initialize an executor that never sees an app's tasks? 
it can't be. Lazy init is a good thing and compatible with the Spark model. If 
startup time is an issue (and I'm still not clear on the latency problem 
mentioned above), then it gets a little more complicated, but, that's also a 
little more niche: just run a dummy mapPartitions at the outset on the same 
data that the first job would touch, even asynchronously if you like with other 
driver activities. No need to wait; it just gives the init a head-start on the 
executors that will need it straight away.

That's just my opinion of course, but I think those are the questions that 
would need to be answered to argue something happens here.

> Add a "setup hook" API for running initialization code on each executor
> -----------------------------------------------------------------------
>                 Key: SPARK-650
>                 URL: https://issues.apache.org/jira/browse/SPARK-650
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>            Reporter: Matei Zaharia
>            Priority: Minor
> Would be useful to configure things like reporting libraries

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to