RE: What does Spark is not just MapReduce mean? Isn't every Spark job a form of MapReduce?

2015-06-29 Thread prajod.vettiyattil
Hi,

Any {fan-out - process in parallel - fan-in - aggregate} pattern of data 
flow can be conceptually Map-Reduce(MR, as it is done in Hadoop).

Apart from the bigger list of map, reduce, sort, filter, pipe, join, 
combine,... functions, that are many times more efficient and productive for 
developers, it is how Spark does it that is different.
Ex: RDDs enable high availability of Data with only a single copy. HDFS needs 
multiple copies causing a lot of I/O resource usage.

Regards,
Prajod

From: Ashic Mahtab [mailto:as...@live.com]
Sent: 28 June 2015 22:21
To: YaoPau; Apache Spark
Subject: RE: What does Spark is not just MapReduce mean? Isn't every Spark 
job a form of MapReduce?

Spark comes with quite a few components. At it's core is..surprisespark 
core. This provides the core things required to run spark jobs. Spark provides 
a lot of operators out of the box...take a look at
https://spark.apache.org/docs/latest/programming-guide.html#transformations
https://spark.apache.org/docs/latest/programming-guide.html#actions

While all of them can be implemented with variations of rd.map().reduce(), 
there are optimisations to be gained in terms of data locality, etc., and the 
additional operators simply make life simpler.

In addition to the core stuff, spark also brings things like Spark Streaming, 
Spark Sql and data frames, MLLib, GraphX, etc. Spark Streaming gives you 
microbatches of rdds at periodic intervals.Think give me the last 15 seconds 
of events every 5 seconds. You can then program towards the small collection, 
and the job will run in a fault tolerant manner on your cluster. Spark Sql 
provides hive like functionality that works nicely with various data sources, 
and RDDs. MLLib provide a lot of oob machine learning algorithms, and the new 
Spark ML project provides a nice elegant pipeline api to take care of a lot of 
common machine learning tasks. GraphX allows you to represent data in graphs, 
and run graph algorithms on it. e.g. you can represent your data as RDDs of 
vertexes and edges, and run pagerank on a distributed cluster.

And there's moreso, yeah...Spark is definitely not just MapReduce. :)
 Date: Sun, 28 Jun 2015 09:13:18 -0700
 From: jonrgr...@gmail.commailto:jonrgr...@gmail.com
 To: user@spark.apache.orgmailto:user@spark.apache.org
 Subject: What does Spark is not just MapReduce mean? Isn't every Spark job 
 a form of MapReduce?

 I've heard Spark is not just MapReduce mentioned during Spark talks, but it
 seems like every method that Spark has is really doing something like (Map
 - Reduce) or (Map - Map - Map - Reduce) etc behind the scenes, with the
 performance benefit of keeping RDDs in memory between stages.

 Am I wrong about that? Is Spark doing anything more efficiently than a
 series of Maps followed by a Reduce in memory? What methods does Spark have
 that can't easily be mapped (with somewhat similar efficiency) to Map and
 Reduce in memory?



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/What-does-Spark-is-not-just-MapReduce-mean-Isn-t-every-Spark-job-a-form-of-MapReduce-tp23518.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: 
 user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
 For additional commands, e-mail: 
 user-h...@spark.apache.orgmailto:user-h...@spark.apache.org

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. WARNING: Computer viruses can be transmitted via email. The 
recipient should check this email and any attachments for the presence of 
viruses. The company accepts no liability for any damage caused by any virus 
transmitted by this email. www.wipro.com


Re: What does Spark is not just MapReduce mean? Isn't every Spark job a form of MapReduce?

2015-06-28 Thread Koert Kuipers
spark is partitioner aware, so it can exploit a situation where 2 datasets
are partitioned the same way (for example by doing a map-side join on
them). map-red does not expose this.

On Sun, Jun 28, 2015 at 12:13 PM, YaoPau jonrgr...@gmail.com wrote:

 I've heard Spark is not just MapReduce mentioned during Spark talks, but
 it
 seems like every method that Spark has is really doing something like (Map
 - Reduce) or (Map - Map - Map - Reduce) etc behind the scenes, with the
 performance benefit of keeping RDDs in memory between stages.

 Am I wrong about that?  Is Spark doing anything more efficiently than a
 series of Maps followed by a Reduce in memory?  What methods does Spark
 have
 that can't easily be mapped (with somewhat similar efficiency) to Map and
 Reduce in memory?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/What-does-Spark-is-not-just-MapReduce-mean-Isn-t-every-Spark-job-a-form-of-MapReduce-tp23518.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: What does Spark is not just MapReduce mean? Isn't every Spark job a form of MapReduce?

2015-06-28 Thread Stephen Boesch
Vanilla map/reduce does not expose it: but hive on top of map/reduce has
superior partitioning (and bucketing) support to Spark.

2015-06-28 13:44 GMT-07:00 Koert Kuipers ko...@tresata.com:

 spark is partitioner aware, so it can exploit a situation where 2 datasets
 are partitioned the same way (for example by doing a map-side join on
 them). map-red does not expose this.

 On Sun, Jun 28, 2015 at 12:13 PM, YaoPau jonrgr...@gmail.com wrote:

 I've heard Spark is not just MapReduce mentioned during Spark talks,
 but it
 seems like every method that Spark has is really doing something like (Map
 - Reduce) or (Map - Map - Map - Reduce) etc behind the scenes, with
 the
 performance benefit of keeping RDDs in memory between stages.

 Am I wrong about that?  Is Spark doing anything more efficiently than a
 series of Maps followed by a Reduce in memory?  What methods does Spark
 have
 that can't easily be mapped (with somewhat similar efficiency) to Map and
 Reduce in memory?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/What-does-Spark-is-not-just-MapReduce-mean-Isn-t-every-Spark-job-a-form-of-MapReduce-tp23518.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





RE: What does Spark is not just MapReduce mean? Isn't every Spark job a form of MapReduce?

2015-06-28 Thread Ashic Mahtab
Spark comes with quite a few components. At it's core is..surprisespark 
core. This provides the core things required to run spark jobs. Spark provides 
a lot of operators out of the box...take a look at 
https://spark.apache.org/docs/latest/programming-guide.html#transformationshttps://spark.apache.org/docs/latest/programming-guide.html#actions
While all of them can be implemented with variations of rd.map().reduce(), 
there are optimisations to be gained in terms of data locality, etc., and the 
additional operators simply make life simpler.
In addition to the core stuff, spark also brings things like Spark Streaming, 
Spark Sql and data frames, MLLib, GraphX, etc. Spark Streaming gives you 
microbatches of rdds at periodic intervals.Think give me the last 15 seconds 
of events every 5 seconds. You can then program towards the small collection, 
and the job will run in a fault tolerant manner on your cluster. Spark Sql 
provides hive like functionality that works nicely with various data sources, 
and RDDs. MLLib provide a lot of oob machine learning algorithms, and the new 
Spark ML project provides a nice elegant pipeline api to take care of a lot of 
common machine learning tasks. GraphX allows you to represent data in graphs, 
and run graph algorithms on it. e.g. you can represent your data as RDDs of 
vertexes and edges, and run pagerank on a distributed cluster.
And there's moreso, yeah...Spark is definitely not just MapReduce. :)

 Date: Sun, 28 Jun 2015 09:13:18 -0700
 From: jonrgr...@gmail.com
 To: user@spark.apache.org
 Subject: What does Spark is not just MapReduce mean?  Isn't every Spark job 
 a form of MapReduce?
 
 I've heard Spark is not just MapReduce mentioned during Spark talks, but it
 seems like every method that Spark has is really doing something like (Map
 - Reduce) or (Map - Map - Map - Reduce) etc behind the scenes, with the
 performance benefit of keeping RDDs in memory between stages.
 
 Am I wrong about that?  Is Spark doing anything more efficiently than a
 series of Maps followed by a Reduce in memory?  What methods does Spark have
 that can't easily be mapped (with somewhat similar efficiency) to Map and
 Reduce in memory?
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/What-does-Spark-is-not-just-MapReduce-mean-Isn-t-every-Spark-job-a-form-of-MapReduce-tp23518.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 
  

RE: What does Spark is not just MapReduce mean? Isn't every Spark job a form of MapReduce?

2015-06-28 Thread Michael Malak
I would also add, from a data locality theoretic standpoint, mapPartitions() 
provides for node-local computation that plain old map-reduce does not.


From my Android phone on T-Mobile. The first nationwide 4G network.

 Original message 
From: Ashic Mahtab as...@live.com 
Date: 06/28/2015  10:51 AM  (GMT-07:00) 
To: YaoPau jonrgr...@gmail.com,Apache Spark user@spark.apache.org 
Subject: RE: What does Spark is not just MapReduce mean?  Isn't every Spark 
job a form of MapReduce? 
 
Spark comes with quite a few components. At it's core is..surprisespark 
core. This provides the core things required to run spark jobs. Spark provides 
a lot of operators out of the box...take a look at 
https://spark.apache.org/docs/latest/programming-guide.html#transformations
https://spark.apache.org/docs/latest/programming-guide.html#actions

While all of them can be implemented with variations of rd.map().reduce(), 
there are optimisations to be gained in terms of data locality, etc., and the 
additional operators simply make life simpler.

In addition to the core stuff, spark also brings things like Spark Streaming, 
Spark Sql and data frames, MLLib, GraphX, etc. Spark Streaming gives you 
microbatches of rdds at periodic intervals.Think give me the last 15 seconds 
of events every 5 seconds. You can then program towards the small collection, 
and the job will run in a fault tolerant manner on your cluster. Spark Sql 
provides hive like functionality that works nicely with various data sources, 
and RDDs. MLLib provide a lot of oob machine learning algorithms, and the new 
Spark ML project provides a nice elegant pipeline api to take care of a lot of 
common machine learning tasks. GraphX allows you to represent data in graphs, 
and run graph algorithms on it. e.g. you can represent your data as RDDs of 
vertexes and edges, and run pagerank on a distributed cluster.

And there's moreso, yeah...Spark is definitely not just MapReduce. :)

 Date: Sun, 28 Jun 2015 09:13:18 -0700
 From: jonrgr...@gmail.com
 To: user@spark.apache.org
 Subject: What does Spark is not just MapReduce mean? Isn't every Spark job 
 a form of MapReduce?
 
 I've heard Spark is not just MapReduce mentioned during Spark talks, but it
 seems like every method that Spark has is really doing something like (Map
 - Reduce) or (Map - Map - Map - Reduce) etc behind the scenes, with the
 performance benefit of keeping RDDs in memory between stages.
 
 Am I wrong about that? Is Spark doing anything more efficiently than a
 series of Maps followed by a Reduce in memory? What methods does Spark have
 that can't easily be mapped (with somewhat similar efficiency) to Map and
 Reduce in memory?
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/What-does-Spark-is-not-just-MapReduce-mean-Isn-t-every-Spark-job-a-form-of-MapReduce-tp23518.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org