Andrew Ash created SPARK-3526:
---------------------------------

             Summary: Section on data locality
                 Key: SPARK-3526
                 URL: https://issues.apache.org/jira/browse/SPARK-3526
             Project: Spark
          Issue Type: Documentation
          Components: Documentation
    Affects Versions: 1.0.2
            Reporter: Andrew Ash


Several threads on the mailing list have been about data locality and how to 
interpret PROCESS_LOCAL, NODE_LOCAL, RACK_LOCAL, etc.  Let's get some more 
details in the docs on this concept so we can point future questions there.

A couple people appreciated the below description of locality so it could be a 
good starting point:

{quote}
The locality is how close the data is to the code that's processing it.  
PROCESS_LOCAL means data is in the same JVM as the code that's running, so it's 
really fast.  NODE_LOCAL might mean that the data is in HDFS on the same node, 
or in another executor on the same node, so is a little slower because the data 
has to travel across an IPC connection.  RACK_LOCAL is even slower -- data is 
on a different server so needs to be sent over the network.

Spark switches to lower locality levels when there's no unprocessed data on a 
node that has idle CPUs.  In that situation you have two options: wait until 
the busy CPUs free up so you can start another task that uses data on that 
server, or start a new task on a farther away server that needs to bring data 
from that remote place.  What Spark typically does is wait a bit in the hopes 
that a busy CPU frees up.  Once that timeout expires, it starts moving the data 
from far away to the free CPU.
{quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to