Github user mmalohlava commented on the pull request: https://github.com/apache/spark/pull/2691#issuecomment-61567145 Sorry for delayed answer. I was trying to provide better solution without modifying Spark. However, regarding Sean's question: * In our case we need to collect actual distributed state (approx. number of executors) of the cluster to properly initialize services on all available executors in cluster. Big picture of our use-case: the proposed solution starts defined service at each executor, the service exchange info with master (collect number of available executors + executor ids), and based on that, we reconfigure services in cluster (they require number of available Spark executors). * I do not see a major security problem in class loading, since Spark is already doing class loading in executor from class path specified via `--jars` and `--files` parameters. The proposed solution is using the same mechanism. Nevertheless, in the meantime i was experimenting with solution based on Patrick's idea. It works in the following way: * create a dummy RDD with lot of partitions (i.e., trying to force scheduler to plan execution on all available executors) * running `map` op on RDD trying to collect collect unique executors ids and aprox. number of executors * running another `map` which starts our service only on collected executors *The advantage of this solution:* * does not need any modification of Spark infrastructure *The major disadvantage of this solution:* * directly depends on task scheduling, in worst case it will plan execution of the initialization only on 1 executor from all available executors * hidden solution which does not expose running services, it collects only approximation of state. * overhead of creating dummy RDD with many partitions and running two map operations From my point of view, it would be much more clean and beneficial to have solution which explicitly allows for interception of executor lifecycle.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org