[ 
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15579928#comment-15579928
 ] 

Sean Owen commented on SPARK-650:
---------------------------------

Yeah that's a decent use case, because latency is an issue (streaming) and you 
potentially have time to set up before latency matters. 

You can still use this approach because empty RDDs arrive if no data has, and 
empty RDDs can still be repartitioned. Here's a way to, once, if the first RDD 
has no data, do something once per partition, which ought to amount to at least 
once per executor:

{code}
var first = true
lines.foreachRDD { rdd =>
  if (first) {
    if (rdd.isEmpty) {
      rdd.repartition(sc.defaultParallelism).foreachPartition(_ => 
Thing.initOnce())
    }
    first = false
  }
}
{code}

"ought", because, there isn't actually a guarantee that it will put the empty 
partitions on different executors. In practice, it seems to, when I just tried 
it.

That's a partial solution, but it's an optimization anyway, and maybe it helps 
you right now. I am still not sure it means this needs a whole mechanism, if 
this is the only type of use case. Maybe there are others.

> Add a "setup hook" API for running initialization code on each executor
> -----------------------------------------------------------------------
>
>                 Key: SPARK-650
>                 URL: https://issues.apache.org/jira/browse/SPARK-650
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>            Reporter: Matei Zaharia
>            Priority: Minor
>
> Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to