Re: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-09-15 Thread Andrew Ash
Hi Brad and Nick,

Thanks for the comments!  I opened a ticket to get a more thorough
explanation of data locality into the docs here:
https://issues.apache.org/jira/browse/SPARK-3526

If you could put any other unanswered questions you have about data
locality on that ticket I'll try to incorporate answers to them in the
final addition I send in.

Andrew

On Sun, Sep 14, 2014 at 6:47 PM, Brad Miller bmill...@eecs.berkeley.edu
wrote:

 Hi Andrew,

 I agree with Nicholas.  That was a nice, concise summary of the
 meaning of the locality customization options, indicators and default
 Spark behaviors.  I haven't combed through the documentation
 end-to-end in a while, but I'm also not sure that information is
 presently represented somewhere and it would be great to persist it
 somewhere besides the mailing list.

 best,
 -Brad

 On Fri, Sep 12, 2014 at 12:12 PM, Nicholas Chammas
 nicholas.cham...@gmail.com wrote:
  Andrew,
 
  This email was pretty helpful. I feel like this stuff should be
 summarized
  in the docs somewhere, or perhaps in a blog post.
 
  Do you know if it is?
 
  Nick
 
 
  On Thu, Jun 5, 2014 at 6:36 PM, Andrew Ash and...@andrewash.com wrote:
 
  The locality is how close the data is to the code that's processing it.
  PROCESS_LOCAL means data is in the same JVM as the code that's running,
 so
  it's really fast.  NODE_LOCAL might mean that the data is in HDFS on the
  same node, or in another executor on the same node, so is a little
 slower
  because the data has to travel across an IPC connection.  RACK_LOCAL is
 even
  slower -- data is on a different server so needs to be sent over the
  network.
 
  Spark switches to lower locality levels when there's no unprocessed data
  on a node that has idle CPUs.  In that situation you have two options:
 wait
  until the busy CPUs free up so you can start another task that uses
 data on
  that server, or start a new task on a farther away server that needs to
  bring data from that remote place.  What Spark typically does is wait a
 bit
  in the hopes that a busy CPU frees up.  Once that timeout expires, it
 starts
  moving the data from far away to the free CPU.
 
  The main tunable option is how far long the scheduler waits before
  starting to move data rather than code.  Those are the spark.locality.*
  settings here: http://spark.apache.org/docs/latest/configuration.html
 
  If you want to prevent this from happening entirely, you can set the
  values to ridiculously high numbers.  The documentation also mentions
 that
  0 has special meaning, so you can try that as well.
 
  Good luck!
  Andrew
 
 
  On Thu, Jun 5, 2014 at 3:13 PM, Sung Hwan Chung 
 coded...@cs.stanford.edu
  wrote:
 
  I noticed that sometimes tasks would switch from PROCESS_LOCAL (I'd
  assume that this means fully cached) to NODE_LOCAL or even RACK_LOCAL.
 
  When these happen things get extremely slow.
 
  Does this mean that the executor got terminated and restarted?
 
  Is there a way to prevent this from happening (barring the machine
  actually going down, I'd rather stick with the same process)?
 
 
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-09-14 Thread Brad Miller
Hi Andrew,

I agree with Nicholas.  That was a nice, concise summary of the
meaning of the locality customization options, indicators and default
Spark behaviors.  I haven't combed through the documentation
end-to-end in a while, but I'm also not sure that information is
presently represented somewhere and it would be great to persist it
somewhere besides the mailing list.

best,
-Brad

On Fri, Sep 12, 2014 at 12:12 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
 Andrew,

 This email was pretty helpful. I feel like this stuff should be summarized
 in the docs somewhere, or perhaps in a blog post.

 Do you know if it is?

 Nick


 On Thu, Jun 5, 2014 at 6:36 PM, Andrew Ash and...@andrewash.com wrote:

 The locality is how close the data is to the code that's processing it.
 PROCESS_LOCAL means data is in the same JVM as the code that's running, so
 it's really fast.  NODE_LOCAL might mean that the data is in HDFS on the
 same node, or in another executor on the same node, so is a little slower
 because the data has to travel across an IPC connection.  RACK_LOCAL is even
 slower -- data is on a different server so needs to be sent over the
 network.

 Spark switches to lower locality levels when there's no unprocessed data
 on a node that has idle CPUs.  In that situation you have two options: wait
 until the busy CPUs free up so you can start another task that uses data on
 that server, or start a new task on a farther away server that needs to
 bring data from that remote place.  What Spark typically does is wait a bit
 in the hopes that a busy CPU frees up.  Once that timeout expires, it starts
 moving the data from far away to the free CPU.

 The main tunable option is how far long the scheduler waits before
 starting to move data rather than code.  Those are the spark.locality.*
 settings here: http://spark.apache.org/docs/latest/configuration.html

 If you want to prevent this from happening entirely, you can set the
 values to ridiculously high numbers.  The documentation also mentions that
 0 has special meaning, so you can try that as well.

 Good luck!
 Andrew


 On Thu, Jun 5, 2014 at 3:13 PM, Sung Hwan Chung coded...@cs.stanford.edu
 wrote:

 I noticed that sometimes tasks would switch from PROCESS_LOCAL (I'd
 assume that this means fully cached) to NODE_LOCAL or even RACK_LOCAL.

 When these happen things get extremely slow.

 Does this mean that the executor got terminated and restarted?

 Is there a way to prevent this from happening (barring the machine
 actually going down, I'd rather stick with the same process)?




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-09-12 Thread Nicholas Chammas
Andrew,

This email was pretty helpful. I feel like this stuff should be summarized
in the docs somewhere, or perhaps in a blog post.

Do you know if it is?

Nick


On Thu, Jun 5, 2014 at 6:36 PM, Andrew Ash and...@andrewash.com wrote:

 The locality is how close the data is to the code that's processing it.
  PROCESS_LOCAL means data is in the same JVM as the code that's running, so
 it's really fast.  NODE_LOCAL might mean that the data is in HDFS on the
 same node, or in another executor on the same node, so is a little slower
 because the data has to travel across an IPC connection.  RACK_LOCAL is
 even slower -- data is on a different server so needs to be sent over the
 network.

 Spark switches to lower locality levels when there's no unprocessed data
 on a node that has idle CPUs.  In that situation you have two options: wait
 until the busy CPUs free up so you can start another task that uses data on
 that server, or start a new task on a farther away server that needs to
 bring data from that remote place.  What Spark typically does is wait a bit
 in the hopes that a busy CPU frees up.  Once that timeout expires, it
 starts moving the data from far away to the free CPU.

 The main tunable option is how far long the scheduler waits before
 starting to move data rather than code.  Those are the spark.locality.*
 settings here: http://spark.apache.org/docs/latest/configuration.html

 If you want to prevent this from happening entirely, you can set the
 values to ridiculously high numbers.  The documentation also mentions that
 0 has special meaning, so you can try that as well.

 Good luck!
 Andrew


 On Thu, Jun 5, 2014 at 3:13 PM, Sung Hwan Chung coded...@cs.stanford.edu
 wrote:

 I noticed that sometimes tasks would switch from PROCESS_LOCAL (I'd
 assume that this means fully cached) to NODE_LOCAL or even RACK_LOCAL.

 When these happen things get extremely slow.

 Does this mean that the executor got terminated and restarted?

 Is there a way to prevent this from happening (barring the machine
 actually going down, I'd rather stick with the same process)?





Re: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-09-12 Thread Tsai Li Ming
Another observation I had was reading over local filesystem with “file://“. it 
was stated as PROCESS_LOCAL which was confusing. 

Regards,
Liming

On 13 Sep, 2014, at 3:12 am, Nicholas Chammas nicholas.cham...@gmail.com 
wrote:

 Andrew,
 
 This email was pretty helpful. I feel like this stuff should be summarized in 
 the docs somewhere, or perhaps in a blog post.
 
 Do you know if it is?
 
 Nick
 
 
 On Thu, Jun 5, 2014 at 6:36 PM, Andrew Ash and...@andrewash.com wrote:
 The locality is how close the data is to the code that's processing it.  
 PROCESS_LOCAL means data is in the same JVM as the code that's running, so 
 it's really fast.  NODE_LOCAL might mean that the data is in HDFS on the same 
 node, or in another executor on the same node, so is a little slower because 
 the data has to travel across an IPC connection.  RACK_LOCAL is even slower 
 -- data is on a different server so needs to be sent over the network.
 
 Spark switches to lower locality levels when there's no unprocessed data on a 
 node that has idle CPUs.  In that situation you have two options: wait until 
 the busy CPUs free up so you can start another task that uses data on that 
 server, or start a new task on a farther away server that needs to bring data 
 from that remote place.  What Spark typically does is wait a bit in the hopes 
 that a busy CPU frees up.  Once that timeout expires, it starts moving the 
 data from far away to the free CPU.
 
 The main tunable option is how far long the scheduler waits before starting 
 to move data rather than code.  Those are the spark.locality.* settings here: 
 http://spark.apache.org/docs/latest/configuration.html
 
 If you want to prevent this from happening entirely, you can set the values 
 to ridiculously high numbers.  The documentation also mentions that 0 has 
 special meaning, so you can try that as well.
 
 Good luck!
 Andrew
 
 
 On Thu, Jun 5, 2014 at 3:13 PM, Sung Hwan Chung coded...@cs.stanford.edu 
 wrote:
 I noticed that sometimes tasks would switch from PROCESS_LOCAL (I'd assume 
 that this means fully cached) to NODE_LOCAL or even RACK_LOCAL.
 
 When these happen things get extremely slow.
 
 Does this mean that the executor got terminated and restarted?
 
 Is there a way to prevent this from happening (barring the machine actually 
 going down, I'd rather stick with the same process)?
 
 



When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-06-05 Thread Sung Hwan Chung
I noticed that sometimes tasks would switch from PROCESS_LOCAL (I'd assume
that this means fully cached) to NODE_LOCAL or even RACK_LOCAL.

When these happen things get extremely slow.

Does this mean that the executor got terminated and restarted?

Is there a way to prevent this from happening (barring the machine actually
going down, I'd rather stick with the same process)?


Re: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-06-05 Thread Sung Hwan Chung
On a related note, I'd also minimize any kind of executor movement. I.e.,
once an executor is spawned and data cached in the executor, I want that
executor to live all the way till the job is finished, or the machine fails
in a fatal manner.

What would be the best way to ensure that this is the case?


On Thu, Jun 5, 2014 at 3:13 PM, Sung Hwan Chung coded...@cs.stanford.edu
wrote:

 I noticed that sometimes tasks would switch from PROCESS_LOCAL (I'd assume
 that this means fully cached) to NODE_LOCAL or even RACK_LOCAL.

 When these happen things get extremely slow.

 Does this mean that the executor got terminated and restarted?

 Is there a way to prevent this from happening (barring the machine
 actually going down, I'd rather stick with the same process)?



RE: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-06-05 Thread Liu, Raymond
If some task have no locality preference,  it will also show up as 
PROCESS_LOCAL, yet, I think we probably need to name it NO_PREFER to make it 
more clear. Not sure is this your case.

Best Regards,
Raymond Liu

From: coded...@gmail.com [mailto:coded...@gmail.com] On Behalf Of Sung Hwan 
Chung
Sent: Friday, June 06, 2014 6:53 AM
To: user@spark.apache.org
Subject: Re: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or 
RACK_LOCAL?

Additionally, I've encountered some confusing situation where the locality 
level for a task showed up as 'PROCESS_LOCAL' even though I didn't cache the 
data. I wonder some implicit caching happens even without the user specifying 
things.

On Thu, Jun 5, 2014 at 3:50 PM, Sung Hwan Chung 
coded...@cs.stanford.edumailto:coded...@cs.stanford.edu wrote:
Thanks Andrew,

Is there a chance that even with full-caching, that modes other than 
PROCESS_LOCAL will be used? E.g., let's say, an executor will try to perform 
tasks although the data are cached on a different executor.

What I'd like to do is to prevent such a scenario entirely.

I'd like to know if setting 'spark.locality.wait' to a very high value would 
guarantee that the mode will always be 'PROCESS_LOCAL'.

On Thu, Jun 5, 2014 at 3:36 PM, Andrew Ash 
and...@andrewash.commailto:and...@andrewash.com wrote:
The locality is how close the data is to the code that's processing it.  
PROCESS_LOCAL means data is in the same JVM as the code that's running, so it's 
really fast.  NODE_LOCAL might mean that the data is in HDFS on the same node, 
or in another executor on the same node, so is a little slower because the data 
has to travel across an IPC connection.  RACK_LOCAL is even slower -- data is 
on a different server so needs to be sent over the network.

Spark switches to lower locality levels when there's no unprocessed data on a 
node that has idle CPUs.  In that situation you have two options: wait until 
the busy CPUs free up so you can start another task that uses data on that 
server, or start a new task on a farther away server that needs to bring data 
from that remote place.  What Spark typically does is wait a bit in the hopes 
that a busy CPU frees up.  Once that timeout expires, it starts moving the data 
from far away to the free CPU.

The main tunable option is how far long the scheduler waits before starting to 
move data rather than code.  Those are the spark.locality.* settings here: 
http://spark.apache.org/docs/latest/configuration.html

If you want to prevent this from happening entirely, you can set the values to 
ridiculously high numbers.  The documentation also mentions that 0 has 
special meaning, so you can try that as well.

Good luck!
Andrew

On Thu, Jun 5, 2014 at 3:13 PM, Sung Hwan Chung 
coded...@cs.stanford.edumailto:coded...@cs.stanford.edu wrote:
I noticed that sometimes tasks would switch from PROCESS_LOCAL (I'd assume that 
this means fully cached) to NODE_LOCAL or even RACK_LOCAL.

When these happen things get extremely slow.

Does this mean that the executor got terminated and restarted?

Is there a way to prevent this from happening (barring the machine actually 
going down, I'd rather stick with the same process)?