Re: Caching and Actions
Your point #1 is a bit misleading. (1) The mappers are not executed in parallel when processing independently the same RDD. To clarify, I'd say: In one stage of execution, when pipelining occurs, mappers are not executed in parallel when processing independently the same RDD partition. On Thu, Apr 9, 2015 at 11:19 AM, spark_user_2015 li...@adobe.com wrote: That was helpful! The conclusion: (1) The mappers are not executed in parallel when processing independently the same RDD. (2) The best way seems to be (if enough memory is available and an action is applied to d1 and d2 later on) val d1 = data.map((x,y,z) = (x,y)).cache val d2 = d1.map((x,y) = (y,x)) - This avoids pipelining the d1 mapper and d2 mapper when computing d2 This is important to write efficient code, toDebugString helps a lot. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Caching-and-Actions-tp22418p22444.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Caching and Actions
You can use toDebugString to see all the steps in job. Best Bojan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Caching-and-Actions-tp22418p22433.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Caching and Actions
I understand that RDDs are not created until an action is called. Is it a correct conclusion that it doesn't matter if .cache is used anywhere in the program if I only have one action that is called only once? Related to this question, consider this situation: val d1 = data.map((x,y,z) = (x,y)) val d2 = data.map((x,y,z) = (y,x)) I'm wondering if Spark is optimizing the execution in a way that the mappers for d1 and d2 are running in parallel and the data RDD is traversed only once. If that is not the case, would it make a difference to cache the data RDD, like this: data.cache() val d1 = data.map((x,y,z) = (x,y)) val d2 = data.map((x,y,z) = (y,x)) Furthermore, consider: val d3 = d2.map((x,y) = (y,x)) d2 and d3 are equivalent. What implementation should be preferred? Thx. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Caching-and-Actions-tp22418.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org