at 8:36 PM, Andrew Ash and...@andrewash.com wrote:
Hi Mohit,
The log line about the ExternalAppendOnlyMap is more of a symptom of
slowness than causing slowness itself. The ExternalAppendOnlyMap is used
when a shuffle is causing too much data to be held in memory. Rather than
OOM'ing, Spark
Here's the 1.0.0rc9 version of the docs:
https://people.apache.org/~pwendell/spark-1.0.0-rc9-docs/running-on-mesos.html
I refreshed them with the goal of steering users more towards prebuilt
packages than relying on compiling from source plus improving overall
formatting and clarity, but not
Hi Mohit,
The log line about the ExternalAppendOnlyMap is more of a symptom of
slowness than causing slowness itself. The ExternalAppendOnlyMap is used
when a shuffle is causing too much data to be held in memory. Rather than
OOM'ing, Spark writes the data out to disk in a sorted order and
One thing you can try is to pull each file out of S3 and decompress with
gzip -d to see if it works. I'm guessing there's a corrupted .gz file
somewhere in your path glob.
Andrew
On Wed, May 21, 2014 at 12:40 PM, Michael Cutler mich...@tumra.com wrote:
Hi Nick,
Which version of Hadoop are
Hi Puneet,
If you're not going to read/write data in HDFS from your Spark cluster,
then it doesn't matter which one you download. Just go with Hadoop 2 as
that's more likely to connect to an HDFS cluster in the future if you ever
do decide to use HDFS because it's the newer APIs.
Cheers,
Andrew
If the distribution of the keys in your groupByKey is skewed (some keys
appear way more often than others) you should consider modifying your job
to use reduceByKey instead wherever possible.
On May 20, 2014 12:53 PM, Jon Keebler jkeeble...@gmail.com wrote:
So we upped the spark.akka.frameSize
?
I know many people might not like it, but maybe the list messages
should have a footer with this administrative info (even if it's just
a link to the archive page)?
On Sun, May 18, 2014 at 1:49 PM, Andrew Ash and...@andrewash.com wrote:
If you'd like to get off this mailing list, please send
Hi yxzhao,
Those are branches in the source code git repository. You can get to them
with git checkout branch-1.0 once you've cloned the git repository.
Cheers,
Andrew
On Mon, May 19, 2014 at 8:30 PM, yxzhao yxz...@ualr.edu wrote:
Thanks Xiangrui,
Sorry I am new for Spark, could
with multiple cores.
2) BZip2 files are big enough or minPartitions is large enough when
you load the file via sc.textFile(), so that one worker has more than
one tasks.
Best,
Xiangrui
On Fri, May 16, 2014 at 4:06 PM, Andrew Ash and...@andrewash.com
wrote:
Hi Xiangrui,
// FYI I'm
Is your RDD of Strings? If so, you should make sure to use the Kryo
serializer instead of the default Java one. It stores strings as UTF8
rather than Java's default UTF16 representation, which can save you half
the memory usage in the right situation.
Try setting the persistence level on the
Spark's
sc.textFile()https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L456
method
delegates to sc.hadoopFile(), which uses Hadoop's
Hi Shangyu (and everyone else looking to unsubscribe!),
If you'd like to get off this mailing list, please send an email to user
*-unsubscribe*@spark.apache.org, not the regular user@spark.apache.org list.
How to use the Apache mailing list infrastructure is documented here:
, 2014 at 9:19 PM, Andrew Ash and...@andrewash.com wrote:
Hi all,
Is anyone reading and writing to .bz2 files stored in HDFS from Spark
with
success?
I'm finding the following results on a recent commit (756c96 from 24hr
ago)
and CDH 4.4.0:
Works: val r = sc.textFile(/user/aa
They are different terminology for the same thing and should be
interchangeable.
On Fri, May 16, 2014 at 2:02 PM, Robert James srobertja...@gmail.comwrote:
What is the difference between a Spark Worker and a Spark Slave?
There's an undocumented mode that looks like it simulates a cluster:
SparkContext.scala:
// Regular expression for simulating a Spark cluster of [N, cores,
memory] locally
val LOCAL_CLUSTER_REGEX =
local-cluster\[\s*([0-9]+)\s*,\s*([0-9]+)\s*,\s*([0-9]+)\s*].r
can you running your tests
Are you setting a core limit with spark.cores.max? If you don't, in coarse
mode each Spark job uses all available cores on Mesos and doesn't let them
go until the job is terminated. At which point the other job can access
the cores.
https://spark.apache.org/docs/latest/running-on-mesos.html --
Hi Weide,
The answer to your first question about local[2] can be found in the
Running the Examples and Shell section of
https://spark.apache.org/docs/latest/
Note that all of the sample programs take a master parameter specifying
the cluster URL to connect to. This can be a URL for a
Hi Eduardo,
Yep those machines look pretty well synchronized at this point. Just
wanted to throw that out there and eliminate it as a possible source of
confusion.
Good luck on continuing the debugging!
Andrew
On Sat, May 3, 2014 at 11:59 AM, Eduardo Costa Alfaia
e.costaalf...@unibs.it
Deenar,
I haven't heard of any activity to do partitioning in that way, but it does
seem more broadly valuable.
On Fri, May 2, 2014 at 10:15 AM, deenar.toraskar deenar.toras...@db.comwrote:
I have equal sized partitions now, but I want the RDD to be partitioned
such
that the partitions are
= {}) // Force evaluation
oldRdd.unpersist(true)
According to my usage pattern i tried to don't unpersist the intermediate
RDDs (i.e. oldRdd) but nothing change.
Any hints? How could i debug this?
2014-04-14 12:55 GMT+02:00 Andrew Ash and...@andrewash.com:
A lot of your time is being
The problem is that equally-sized partitions take variable time to complete
based on their contents?
Sent from my mobile phone
On May 1, 2014 8:31 AM, deenar.toraskar deenar.toras...@db.com wrote:
Hi
I am using Spark to distribute computationally intensive tasks across the
cluster. Currently
For the second question, you can submit multiple jobs through the same
SparkContext via different threads and this is a supported way of
interacting with Spark.
From the documentation:
Second, *within* each Spark application, multiple “jobs” (Spark actions)
may be running concurrently if they
That thread was mostly about benchmarking YARN vs standalone, and the
results are what I'd expect -- spinning up a Spark cluster on demand
through YARN has higher startup latency than using a standalone cluster,
where the JVMs are already initialized and ready.
Given that there's a lot more
, at 12:08 PM, Andrew Ash and...@andrewash.com wrote:
That thread was mostly about benchmarking YARN vs standalone, and the
results are what I'd expect -- spinning up a Spark cluster on demand
through YARN has higher startup latency than using a standalone cluster,
where the JVMs are already
The homepage for Ooyala's job server is here:
https://github.com/ooyala/spark-jobserver
They decided (I think with input from the Spark team) that it made more
sense to keep the jobserver in a separate repository for now.
Andrew
On Fri, Apr 18, 2014 at 5:42 AM, Azuryy Yu azury...@gmail.com
a
mapPartitions() operation to let you do whatever you want with a partition.
What I need is a way to get my hands on two partitions at once, each from
different RDDs.
Any ideas?
Thanks,
Roger
On Mon, Apr 14, 2014 at 5:45 PM, Andrew Ash and...@andrewash.com wrote:
Are your IPRanges all
A lot of your time is being spent in garbage collection (second image).
Maybe your dataset doesn't easily fit into memory? Can you reduce the
number of new objects created in myFun?
How big are your heap sizes?
Another observation is that in the 4th image some of your RDDs are massive
and some
Are your IPRanges all on nice, even CIDR-format ranges? E.g. 192.168.0.0/16or
10.0.0.0/8?
If the range is always an even subnet mask and not split across subnets,
I'd recommend flatMapping the ipToUrl RDD to (IPRange, String) and then
joining the two RDDs. The expansion would be at most 32x if
Hi Spark users,
I've always done all my Spark work in Scala, but occasionally people ask
about Python and its performance impact vs the same algorithm
implementation in Scala.
Has anyone done tests to measure the difference?
Anecdotally I've heard Python is a 40% slowdown but that's entirely
once, so as long as
I get it done (even if it's a more manual process than it should be) is ok.
Hope that helps!
Andrew
On Sun, Apr 13, 2014 at 4:33 PM, Jim Blomo jim.bl...@gmail.com wrote:
On Thu, Apr 10, 2014 at 12:24 PM, Andrew Ash and...@andrewash.com wrote:
The biggest issue I've come
The naive way would be to put all the users and their attributes into an
RDD, then cartesian product that with itself. Run the similarity score on
every pair (1M * 1M = 1T scores), map to (user, (score, otherUser)) and
take the .top(k) for each user.
I doubt that you'll be able to take this
The biggest issue I've come across is that the cluster is somewhat unstable
when under memory pressure. Meaning that if you attempt to persist an RDD
that's too big for memory, even with MEMORY_AND_DISK, you'll often still
get OOMs. I had to carefully modify some of the space tuning parameters
A JVM can easily be limited in how much memory it uses with the -Xmx
parameter, but Python doesn't have memory limits built in in such a
first-class way. Maybe the memory limits aren't making it to the python
executors.
What was your SPARK_MEM setting? The JVM below seems to be using 603201
For 1, persist can be used to save an RDD to disk using the various
persistence levels. When a persistency level is set on an RDD, when that
RDD is evaluated it's saved to memory/disk/elsewhere so that it can be
re-used. It's applied to that RDD, so that subsequent uses of the RDD can
use the
the entire RDD has been realized in memory. Is that
correct?
-Suren
On Wed, Apr 9, 2014 at 9:25 AM, Andrew Ash and...@andrewash.com wrote:
For 1, persist can be used to save an RDD to disk using the various
persistence levels. When a persistency level is set on an RDD, when that
RDD
The typical way to handle that use case would be to join the 3 files
together into one RDD and then do the factorization on that. There will
definitely be network traffic during the initial join to get everything
into one table, and after that there will likely be more network traffic
for various
and then call persist on the resulting RDD. So I'm
wondering if groupByKey is aware of the subsequent persist setting to use
disk or just creates the Seq[V] in memory and only uses disk after that
data structure is fully realized in memory.
-Suren
On Wed, Apr 9, 2014 at 9:46 AM, Andrew Ash
Not that I know of, but it would be great if that was supported. The way I
typically handle security now is to put the Spark servers in their own
subnet with strict inbound/outbound firewalls.
On Tue, Apr 8, 2014 at 1:14 PM, kamatsuoka ken...@gmail.com wrote:
Can Spark be configured to use
If you set up Spark's metrics reporting to write to the Ganglia backend
that will give you a good idea of how much network/disk/CPU is being used
and on what machines.
https://spark.apache.org/docs/0.9.0/monitoring.html
On Tue, Apr 8, 2014 at 12:57 PM, yxzhao yxz...@ualr.edu wrote:
Hi All,
One thing you could do is create an RDD of [1,2,3] and set a partitioner
that puts all three values on their own nodes. Then .foreach() over the
RDD and call your function that will run on each node.
Why do you need to run the function on every node? Is it some sort of
setup code that needs to
I occasionally see links to pages in the spark.incubator.apache.org domain.
Can we HTTP 301 redirect that whole domain to spark.apache.org now that
the project has graduated? The content seems identical.
That would also make the eventual decommission of the incubator domain much
easier as usage
Is this spark 0.9.0? Try setting spark.shuffle.spill=false There was a hash
collision bug that's fixed in 0.9.1 that might cause you to have too few
results in that join.
Sent from my mobile phone
On Mar 28, 2014 8:04 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
Weird, how exactly are you
My thought would be to key by the first item in each array, then take just
one array for each key. Something like the below:
v = sc.parallelize(Seq(Seq(1,2,3,4),Seq(1,5,2,3),Seq(2,3,4,5)))
col = 0
output = v.keyBy(_(col)).reduceByKey(a,b = a).values
On Tue, Mar 25, 2014 at 1:21 AM, Chengi Liu
Possibly one of your executors is in the middle of a large stop-the-world
GC and doesn't respond to network traffic during that period? If you
shared some information about how each node in your cluster is set up (heap
size, memory, CPU, etc) that might help with debugging.
Andrew
On Mon, Mar
FWIW I've seen correctness errors with spark.shuffle.spill on 0.9.0 and
have it disabled now. The specific error behavior was that a join would
consistently return one count of rows with spill enabled and another count
with it disabled.
Sent from my mobile phone
On Mar 22, 2014 1:52 PM, Kane
Jim, I'm starting to document the heap size settings all in one place,
which has been a confusion for a lot of my peers. Maybe you can take a
look at this ticket?
https://spark-project.atlassian.net/browse/SPARK-1264
On Wed, Mar 19, 2014 at 12:53 AM, Jim Blomo jim.bl...@gmail.com wrote:
To
Hi Punya,
This seems like a problem that the recently-announced job-server would
likely have run into at one point. I haven't tested it yet, but I'd be
interested to see what happens when two jobs in the job server have
conflicting classes. Does the server correctly segregate each job's
classes
polkosity, have you seen the job server that Ooyala open sourced? I think
it's very similar to what you're proposing with a REST API and re-using a
SparkContext.
https://github.com/apache/incubator-spark/pull/222
http://engineering.ooyala.com/blog/open-sourcing-our-spark-job-server
On Mon, Mar
101 - 148 of 148 matches
Mail list logo