We are developing an RDD (and later a DataSource on top of it) to access
distributed data in our Spark cluster and want to achive co-location of
tasks working on the data with their source data partitions.

Overriding RDD.getPreferredLocations should be the way to achieve that, so
each RDD partition can indicate on which server it should be processed.
Unfortunately, there seems to be no clearly defined way how Spark identifies
the server on which an Executor is running: With the mesos cluster manager,
Spark tracks Executors by fully qualified host name, with standalone cluster
manager, it used the IP address in Spark 1.5. In Spark 1.6 this seems to
have changed to a host name.

The code in the Spark task scheduler that matches preferred locations to
Executors does a map lookup and requires an exact textual match of the
string specified as preferred location with the host name provided by the
Executor. So, if the formats don't match, task locality handling does not
work at all, but there seems to be no "standard" format for the location. So
how can one write a custom RDD overriding getPreferredLocations that will
work without specific dependencies on a concrete Spark setup?

There seems to be no way for user code to get access to the internal
scheduler info tracking executors by host. It seems that even the host name
reported to a SparkListener for ExecutorAdded is not reliably the same value
that the scheduler uses internally for lookup.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-achieve-co-location-of-task-and-source-data-tp26317.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to