Github user wulei-bj-cn commented on the pull request:
https://github.com/apache/spark/pull/8533#issuecomment-136570529
Dear Owen, thanks for checking my updates. I'm not saying this locality
level being ANY all the time issue is caused by your code. Actually, it lies in
code in org.apache.spark.scheduler.TaskSetManager:
// Check for node-local tasks
if (TaskLocality.isAllowed(locality, TaskLocality.NODE_LOCAL)) {
for (index <- speculatableTasks if canRunOnHost(index)) {
val locations = tasks(index).preferredLocations.map(_.host)
if (locations.contains(host))
{ speculatableTasks -= index return Some((index, TaskLocality.NODE_LOCAL)) }
}
}
The variable "locations" is hostnames of HDFS splits, which is from
InetAddress.getHostName.
The variable "host" is IP address of an executor, which is from
InetAddress.getHostAddress.
And this "host" variable's value is read from
org.apache.spark.deploy.worker.WorkerArguments
where var host = Utils.localHostName()
Therefore, it leads to Utils.scala. I'm not saying we have to update
Utils.scala to make things work, maybe we could update codes somewhere else to
make this ANY go away too. Yet I just thought maybe updating a little bit code
within Utils.scala would be kind of an easier way to do that. Or probably I'm
wrong :)
Your solution of giving end user an option of "SPARK_LOCAL_HOSTNAME" works
fine, given the tests I did with/without it on a Spark cluster of 4 nodes. No
offense, but this setting is not typical in popular distributed computing
systems. I mean, when it comes to deployment and maintenance, the configuration
files (in our case, files under $SPARK_HOME/conf) should all be the same on all
cluster nodes. However, this "SPARK_LOCAL_HOSTNAME" definitely will introduce
differences on different nodes. And that's why I'd like to introduce a new
setting "SPARK_USE_HOSTNAME", whose value could be the same on all cluster
nodes, i.e. either "true" or "false".
About the multiple NICs you mentioned, I think it is a concern that OS
should care about instead of our Spark.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]