Re: Caching and Actions

2015-04-09 Thread Sameer Farooqui
Your point #1 is a bit misleading.

 (1) The mappers are not executed in parallel when processing
independently the same RDD.

To clarify, I'd say: In one stage of execution, when pipelining occurs,
mappers are not executed in parallel when processing independently the same
RDD partition.

On Thu, Apr 9, 2015 at 11:19 AM, spark_user_2015 li...@adobe.com wrote:

 That was helpful!

 The conclusion:
 (1) The mappers are not executed in parallel when processing independently
 the same RDD.
 (2) The best way seems to be (if enough memory is available and an action
 is
 applied to d1 and d2 later on)
val d1 = data.map((x,y,z) = (x,y)).cache
val d2 = d1.map((x,y) = (y,x))
  -  This avoids pipelining the d1 mapper and d2 mapper when
 computing d2

 This is important to write efficient code, toDebugString helps a lot.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Caching-and-Actions-tp22418p22444.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Caching and Actions

2015-04-09 Thread Bojan Kostic
You can use toDebugString to see all the steps in job.

Best
Bojan



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Caching-and-Actions-tp22418p22433.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Caching and Actions

2015-04-07 Thread spark_user_2015
I understand that RDDs are not created until an action is called. Is it a
correct conclusion that it doesn't matter if .cache is used anywhere in
the program if I only have one action that is called only once?

Related to this question, consider this situation: 
val d1 = data.map((x,y,z) = (x,y))
val d2 = data.map((x,y,z) = (y,x))

I'm wondering if Spark is optimizing the execution in a way that the mappers
for d1 and d2 are running in parallel and the data RDD is traversed only
once.

If that is not the case, would it make a difference to cache the data RDD,
like this:
data.cache()
val d1 = data.map((x,y,z) = (x,y))
val d2 = data.map((x,y,z) = (y,x))

Furthermore, consider:
val d3 = d2.map((x,y) = (y,x))

d2 and d3 are equivalent. What implementation should be preferred?

Thx.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Caching-and-Actions-tp22418.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org