Re: Comparative study

2014-07-09 Thread Sean Owen
On Wed, Jul 9, 2014 at 1:52 AM, Keith Simmons ke...@pulse.io wrote: Impala is *not* built on map/reduce, though it was built to replace Hive, which is map/reduce based. It has its own distributed query engine, though it does load data from HDFS, and is part of the hadoop ecosystem. Impala

Re: Comparative study

2014-07-09 Thread Keith Simmons
Good point. Shows how personal use cases color how we interpret products. On Wed, Jul 9, 2014 at 1:08 AM, Sean Owen so...@cloudera.com wrote: On Wed, Jul 9, 2014 at 1:52 AM, Keith Simmons ke...@pulse.io wrote: Impala is *not* built on map/reduce, though it was built to replace Hive, which

Re: Comparative study

2014-07-08 Thread Daniel Siegmann
In addition to Scalding and Scrunch, there is Scoobi. Unlike the others, it is only Scala (it doesn't wrap a Java framework). All three have fairly similar APIs and aren't too different from Spark. For example, instead of RDD you have DList (distributed list) or PCollection (parallel collection) -

Re: Comparative study

2014-07-08 Thread Daniel Siegmann
I don't have those numbers off-hand. Though the shuffle spill to disk was coming to several gigabytes per node, if I recall correctly. The MapReduce pipeline takes about 2-3 hours I think for the full 60 day data set. Spark chugs along fine for awhile and then hangs. We restructured the flow a

Re: Comparative study

2014-07-08 Thread Kevin Markey
When you say "large data sets", how large? Thanks On 07/07/2014 01:39 PM, Daniel Siegmann wrote: From a development perspective, I vastly prefer Spark to MapReduce. The MapReduce API is very constrained; Spark's

Re: Comparative study

2014-07-08 Thread Surendranauth Hiraman
I'll respond for Dan. Our test dataset was a total of 10 GB of input data (full production dataset for this particular dataflow would be 60 GB roughly). I'm not sure what the size of the final output data was but I think it was on the order of 20 GBs for the given 10 GB of input data. Also, I

Re: Comparative study

2014-07-08 Thread Daniel Siegmann
I believe our full 60 days of data contains over ten million unique entities. Across 10 days I'm not sure, but it should be in the millions. I haven't verified that myself though. So that's the scale of the RDD we're writing to disk (each entry is entityId - profile). I think it's hard to know

Re: Comparative study

2014-07-08 Thread Kevin Markey
It seems to me that you're not taking full advantage of the lazy evaluation, especially persisting to disk only.  While it might be true that the cumulative size of the RDDs looks like it's 300GB, only a small portion of that should be resident at any one time.  We've

Re: Comparative study

2014-07-08 Thread Surendranauth Hiraman
To clarify, we are not persisting to disk. That was just one of the experiments we did because of some issues we had along the way. At this time, we are NOT using persist but cannot get the flow to complete in Standalone Cluster mode. We do not have a YARN-capable cluster at this time. We agree

Re: Comparative study

2014-07-08 Thread Kevin Markey
Nothing particularly custom.  We've tested with small (4 node) development clusters, single-node pseudoclusters, and bigger, using plain-vanilla Hadoop 2.2 or 2.3 or CDH5 (beta and beyond), in Spark master, Spark local, Spark Yarn (client and cluster) modes, with total

Re: Comparative study

2014-07-08 Thread Surendranauth Hiraman
How wide are the rows of data, either the raw input data or any generated intermediate data? We are at a loss as to why our flow doesn't complete. We banged our heads against it for a few weeks. -Suren On Tue, Jul 8, 2014 at 2:12 PM, Kevin Markey kevin.mar...@oracle.com wrote: Nothing

Re: Comparative study

2014-07-08 Thread Surendranauth Hiraman
Also, our exact same flow but with 1 GB of input data completed fine. -Suren On Tue, Jul 8, 2014 at 2:16 PM, Surendranauth Hiraman suren.hira...@velos.io wrote: How wide are the rows of data, either the raw input data or any generated intermediate data? We are at a loss as to why our flow

Re: Comparative study

2014-07-08 Thread Surendranauth Hiraman
We kind of hijacked Santos' original thread, so apologies for that and let me try to get back to Santos' original question on Map/Reduce versus Spark. I would say it's worth migrating from M/R, with the following thoughts. Just my opinion but I would summarize the latest emails in this thread as

Re: Comparative study

2014-07-08 Thread Sean Owen
On Tue, Jul 8, 2014 at 8:32 PM, Surendranauth Hiraman suren.hira...@velos.io wrote: Libraries like Scoobi, Scrunch and Scalding (and their associated Java versions) provide a Spark-like wrapper around Map/Reduce but my guess is that, since they are limited to Map/Reduce under the covers, they

Re: Comparative study

2014-07-08 Thread Reynold Xin
Not sure exactly what is happening but perhaps there are ways to restructure your program for it to work better. Spark is definitely able to handle much, much larger workloads. I've personally run a workload that shuffled 300 TB of data. I've also ran something that shuffled 5TB/node and stuffed

Re: Comparative study

2014-07-08 Thread Daniel Siegmann
I think we're missing the point a bit. Everything was actually flowing through smoothly and in a reasonable time. Until it reached the last two tasks (out of over a thousand in the final stage alone), at which point it just fell into a coma. Not so much as a cranky message in the logs. I don't

Re: Comparative study

2014-07-08 Thread Aaron Davidson
Not sure exactly what is happening but perhaps there are ways to restructure your program for it to work better. Spark is definitely able to handle much, much larger workloads. +1 @Reynold Spark can handle big big data. There are known issues with informing the user about what went wrong

Re: Comparative study

2014-07-08 Thread Surendranauth Hiraman
Aaron, I don't think anyone was saying Spark can't handle this data size, given testimony from the Spark team, Bizo, etc., on large datasets. This has kept us trying different things to get our flow to work over the course of several weeks. Agreed that the first instinct should be what did I do

Re: Comparative study

2014-07-08 Thread Robert James
As a new user, I can definitely say that my experience with Spark has been rather raw. The appeal of interactive, batch, and in between all using more or less straight Scala is unarguable. But the experience of deploying Spark has been quite painful, mainly about gaps between compile time and

Re: Comparative study

2014-07-08 Thread Keith Simmons
Santosh, To add a bit more to what Nabeel said, Spark and Impala are very different tools. Impala is *not* built on map/reduce, though it was built to replace Hive, which is map/reduce based. It has its own distributed query engine, though it does load data from HDFS, and is part of the hadoop

Re: Comparative study

2014-07-07 Thread Daniel Siegmann
From a development perspective, I vastly prefer Spark to MapReduce. The MapReduce API is very constrained; Spark's API feels much more natural to me. Testing and local development is also very easy - creating a local Spark context is trivial and it reads local files. For your unit tests you can

RE: Comparative study

2014-07-07 Thread santosh.viswanathan
Thanks Daniel for sharing this info. Regards, Santosh Karthikeyan From: Daniel Siegmann [mailto:daniel.siegm...@velos.io] Sent: Tuesday, July 08, 2014 1:10 AM To: user@spark.apache.org Subject: Re: Comparative study From a development perspective, I vastly prefer Spark to MapReduce

Re: Comparative study

2014-07-07 Thread Nabeel Memon
Siegmann [mailto:daniel.siegm...@velos.io] *Sent:* Tuesday, July 08, 2014 1:10 AM *To:* user@spark.apache.org *Subject:* Re: Comparative study From a development perspective, I vastly prefer Spark to MapReduce. The MapReduce API is very constrained; Spark's API feels much more natural to me

Re: Comparative study

2014-07-07 Thread Sean Owen
On Tue, Jul 8, 2014 at 1:05 AM, Nabeel Memon nm3...@gmail.com wrote: For Scala API on map/reduce (hadoop engine) there's a library called Scalding. It's built on top of Cascading. If you have a huge dataset or if you consider using map/reduce engine for your job, for any reason, you can try

Re: Comparative study

2014-07-07 Thread Soumya Simanta
Daniel, Do you mind sharing the size of your cluster and the production data volumes ? Thanks Soumya On Jul 7, 2014, at 3:39 PM, Daniel Siegmann daniel.siegm...@velos.io wrote: From a development perspective, I vastly prefer Spark to MapReduce. The MapReduce API is very constrained;