On Wed, Jul 9, 2014 at 1:52 AM, Keith Simmons ke...@pulse.io wrote:
Impala is *not* built on map/reduce, though it was built to replace Hive,
which is map/reduce based. It has its own distributed query engine, though
it does load data from HDFS, and is part of the hadoop ecosystem. Impala
Good point. Shows how personal use cases color how we interpret products.
On Wed, Jul 9, 2014 at 1:08 AM, Sean Owen so...@cloudera.com wrote:
On Wed, Jul 9, 2014 at 1:52 AM, Keith Simmons ke...@pulse.io wrote:
Impala is *not* built on map/reduce, though it was built to replace
Hive, which
In addition to Scalding and Scrunch, there is Scoobi. Unlike the others, it
is only Scala (it doesn't wrap a Java framework). All three have fairly
similar APIs and aren't too different from Spark. For example, instead of
RDD you have DList (distributed list) or PCollection (parallel collection)
-
I don't have those numbers off-hand. Though the shuffle spill to disk was
coming to several gigabytes per node, if I recall correctly.
The MapReduce pipeline takes about 2-3 hours I think for the full 60 day
data set. Spark chugs along fine for awhile and then hangs. We restructured
the flow a
When you say "large data sets", how large?
Thanks
On 07/07/2014 01:39 PM, Daniel Siegmann
wrote:
From a development perspective, I vastly prefer Spark to
MapReduce. The MapReduce API is very constrained; Spark's
I'll respond for Dan.
Our test dataset was a total of 10 GB of input data (full production
dataset for this particular dataflow would be 60 GB roughly).
I'm not sure what the size of the final output data was but I think it was
on the order of 20 GBs for the given 10 GB of input data. Also, I
I believe our full 60 days of data contains over ten million unique
entities. Across 10 days I'm not sure, but it should be in the millions. I
haven't verified that myself though. So that's the scale of the RDD we're
writing to disk (each entry is entityId - profile).
I think it's hard to know
It seems to me that you're not taking full advantage of the lazy
evaluation, especially persisting to disk only. While it might be
true that the cumulative size of the RDDs looks like it's 300GB,
only a small portion of that should be resident at any one time.
We've
To clarify, we are not persisting to disk. That was just one of the
experiments we did because of some issues we had along the way.
At this time, we are NOT using persist but cannot get the flow to complete
in Standalone Cluster mode. We do not have a YARN-capable cluster at this
time.
We agree
Nothing particularly custom. We've tested with small (4 node)
development clusters, single-node pseudoclusters, and bigger, using
plain-vanilla Hadoop 2.2 or 2.3 or CDH5 (beta and beyond), in Spark
master, Spark local, Spark Yarn (client and cluster) modes, with
total
How wide are the rows of data, either the raw input data or any generated
intermediate data?
We are at a loss as to why our flow doesn't complete. We banged our heads
against it for a few weeks.
-Suren
On Tue, Jul 8, 2014 at 2:12 PM, Kevin Markey kevin.mar...@oracle.com
wrote:
Nothing
Also, our exact same flow but with 1 GB of input data completed fine.
-Suren
On Tue, Jul 8, 2014 at 2:16 PM, Surendranauth Hiraman
suren.hira...@velos.io wrote:
How wide are the rows of data, either the raw input data or any generated
intermediate data?
We are at a loss as to why our flow
We kind of hijacked Santos' original thread, so apologies for that and let
me try to get back to Santos' original question on Map/Reduce versus Spark.
I would say it's worth migrating from M/R, with the following thoughts.
Just my opinion but I would summarize the latest emails in this thread as
On Tue, Jul 8, 2014 at 8:32 PM, Surendranauth Hiraman
suren.hira...@velos.io wrote:
Libraries like Scoobi, Scrunch and Scalding (and their associated Java
versions) provide a Spark-like wrapper around Map/Reduce but my guess is
that, since they are limited to Map/Reduce under the covers, they
Not sure exactly what is happening but perhaps there are ways to
restructure your program for it to work better. Spark is definitely able to
handle much, much larger workloads.
I've personally run a workload that shuffled 300 TB of data. I've also ran
something that shuffled 5TB/node and stuffed
I think we're missing the point a bit. Everything was actually flowing
through smoothly and in a reasonable time. Until it reached the last two
tasks (out of over a thousand in the final stage alone), at which point it
just fell into a coma. Not so much as a cranky message in the logs.
I don't
Not sure exactly what is happening but perhaps there are ways to
restructure your program for it to work better. Spark is definitely able to
handle much, much larger workloads.
+1 @Reynold
Spark can handle big big data. There are known issues with informing the
user about what went wrong
Aaron,
I don't think anyone was saying Spark can't handle this data size, given
testimony from the Spark team, Bizo, etc., on large datasets. This has kept
us trying different things to get our flow to work over the course of
several weeks.
Agreed that the first instinct should be what did I do
As a new user, I can definitely say that my experience with Spark has
been rather raw. The appeal of interactive, batch, and in between all
using more or less straight Scala is unarguable. But the experience
of deploying Spark has been quite painful, mainly about gaps between
compile time and
Santosh,
To add a bit more to what Nabeel said, Spark and Impala are very different
tools. Impala is *not* built on map/reduce, though it was built to replace
Hive, which is map/reduce based. It has its own distributed query engine,
though it does load data from HDFS, and is part of the hadoop
From a development perspective, I vastly prefer Spark to MapReduce. The
MapReduce API is very constrained; Spark's API feels much more natural to
me. Testing and local development is also very easy - creating a local
Spark context is trivial and it reads local files. For your unit tests you
can
Thanks Daniel for sharing this info.
Regards,
Santosh Karthikeyan
From: Daniel Siegmann [mailto:daniel.siegm...@velos.io]
Sent: Tuesday, July 08, 2014 1:10 AM
To: user@spark.apache.org
Subject: Re: Comparative study
From a development perspective, I vastly prefer Spark to MapReduce
Siegmann [mailto:daniel.siegm...@velos.io]
*Sent:* Tuesday, July 08, 2014 1:10 AM
*To:* user@spark.apache.org
*Subject:* Re: Comparative study
From a development perspective, I vastly prefer Spark to MapReduce. The
MapReduce API is very constrained; Spark's API feels much more natural to
me
On Tue, Jul 8, 2014 at 1:05 AM, Nabeel Memon nm3...@gmail.com wrote:
For Scala API on map/reduce (hadoop engine) there's a library called
Scalding. It's built on top of Cascading. If you have a huge dataset or
if you consider using map/reduce engine for your job, for any reason, you
can try
Daniel,
Do you mind sharing the size of your cluster and the production data volumes ?
Thanks
Soumya
On Jul 7, 2014, at 3:39 PM, Daniel Siegmann daniel.siegm...@velos.io wrote:
From a development perspective, I vastly prefer Spark to MapReduce. The
MapReduce API is very constrained;
25 matches
Mail list logo