[
https://issues.apache.org/jira/browse/SPARK-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14133935#comment-14133935
]
Nicholas Chammas commented on SPARK-3526:
-----------------------------------------
FYI: Looks like the valid localities are [enumerated
here|https://github.com/apache/spark/blob/cc14644460872efb344e8d895859d70213a40840/core/src/main/scala/org/apache/spark/scheduler/TaskLocality.scala#L25].
> Docs section on data locality
> -----------------------------
>
> Key: SPARK-3526
> URL: https://issues.apache.org/jira/browse/SPARK-3526
> Project: Spark
> Issue Type: Documentation
> Components: Documentation
> Affects Versions: 1.0.2
> Reporter: Andrew Ash
> Assignee: Andrew Ash
>
> Several threads on the mailing list have been about data locality and how to
> interpret PROCESS_LOCAL, NODE_LOCAL, RACK_LOCAL, etc. Let's get some more
> details in the docs on this concept so we can point future questions there.
> A couple people appreciated the below description of locality so it could be
> a good starting point:
> {quote}
> The locality is how close the data is to the code that's processing it.
> PROCESS_LOCAL means data is in the same JVM as the code that's running, so
> it's really fast. NODE_LOCAL might mean that the data is in HDFS on the same
> node, or in another executor on the same node, so is a little slower because
> the data has to travel across an IPC connection. RACK_LOCAL is even slower
> -- data is on a different server so needs to be sent over the network.
> Spark switches to lower locality levels when there's no unprocessed data on a
> node that has idle CPUs. In that situation you have two options: wait until
> the busy CPUs free up so you can start another task that uses data on that
> server, or start a new task on a farther away server that needs to bring data
> from that remote place. What Spark typically does is wait a bit in the hopes
> that a busy CPU frees up. Once that timeout expires, it starts moving the
> data from far away to the free CPU.
> {quote}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]