polkosity, have you seen the job server that Ooyala open sourced? I think
it's very similar to what you're proposing with a REST API and re-using a
SparkContext.
https://github.com/apache/incubator-spark/pull/222
http://engineering.ooyala.com/blog/open-sourcing-our-spark-job-server
On Mon, Mar
Hi Punya,
This seems like a problem that the recently-announced job-server would
likely have run into at one point. I haven't tested it yet, but I'd be
interested to see what happens when two jobs in the job server have
conflicting classes. Does the server correctly segregate each job's
classes
Jim, I'm starting to document the heap size settings all in one place,
which has been a confusion for a lot of my peers. Maybe you can take a
look at this ticket?
https://spark-project.atlassian.net/browse/SPARK-1264
On Wed, Mar 19, 2014 at 12:53 AM, Jim Blomo jim.bl...@gmail.com wrote:
To
FWIW I've seen correctness errors with spark.shuffle.spill on 0.9.0 and
have it disabled now. The specific error behavior was that a join would
consistently return one count of rows with spill enabled and another count
with it disabled.
Sent from my mobile phone
On Mar 22, 2014 1:52 PM, Kane
My thought would be to key by the first item in each array, then take just
one array for each key. Something like the below:
v = sc.parallelize(Seq(Seq(1,2,3,4),Seq(1,5,2,3),Seq(2,3,4,5)))
col = 0
output = v.keyBy(_(col)).reduceByKey(a,b = a).values
On Tue, Mar 25, 2014 at 1:21 AM, Chengi Liu
Possibly one of your executors is in the middle of a large stop-the-world
GC and doesn't respond to network traffic during that period? If you
shared some information about how each node in your cluster is set up (heap
size, memory, CPU, etc) that might help with debugging.
Andrew
On Mon, Mar
Is this spark 0.9.0? Try setting spark.shuffle.spill=false There was a hash
collision bug that's fixed in 0.9.1 that might cause you to have too few
results in that join.
Sent from my mobile phone
On Mar 28, 2014 8:04 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
Weird, how exactly are you
I occasionally see links to pages in the spark.incubator.apache.org domain.
Can we HTTP 301 redirect that whole domain to spark.apache.org now that
the project has graduated? The content seems identical.
That would also make the eventual decommission of the incubator domain much
easier as usage
Not that I know of, but it would be great if that was supported. The way I
typically handle security now is to put the Spark servers in their own
subnet with strict inbound/outbound firewalls.
On Tue, Apr 8, 2014 at 1:14 PM, kamatsuoka ken...@gmail.com wrote:
Can Spark be configured to use
If you set up Spark's metrics reporting to write to the Ganglia backend
that will give you a good idea of how much network/disk/CPU is being used
and on what machines.
https://spark.apache.org/docs/0.9.0/monitoring.html
On Tue, Apr 8, 2014 at 12:57 PM, yxzhao yxz...@ualr.edu wrote:
Hi All,
One thing you could do is create an RDD of [1,2,3] and set a partitioner
that puts all three values on their own nodes. Then .foreach() over the
RDD and call your function that will run on each node.
Why do you need to run the function on every node? Is it some sort of
setup code that needs to
A JVM can easily be limited in how much memory it uses with the -Xmx
parameter, but Python doesn't have memory limits built in in such a
first-class way. Maybe the memory limits aren't making it to the python
executors.
What was your SPARK_MEM setting? The JVM below seems to be using 603201
For 1, persist can be used to save an RDD to disk using the various
persistence levels. When a persistency level is set on an RDD, when that
RDD is evaluated it's saved to memory/disk/elsewhere so that it can be
re-used. It's applied to that RDD, so that subsequent uses of the RDD can
use the
the entire RDD has been realized in memory. Is that
correct?
-Suren
On Wed, Apr 9, 2014 at 9:25 AM, Andrew Ash and...@andrewash.com wrote:
For 1, persist can be used to save an RDD to disk using the various
persistence levels. When a persistency level is set on an RDD, when that
RDD
The typical way to handle that use case would be to join the 3 files
together into one RDD and then do the factorization on that. There will
definitely be network traffic during the initial join to get everything
into one table, and after that there will likely be more network traffic
for various
and then call persist on the resulting RDD. So I'm
wondering if groupByKey is aware of the subsequent persist setting to use
disk or just creates the Seq[V] in memory and only uses disk after that
data structure is fully realized in memory.
-Suren
On Wed, Apr 9, 2014 at 9:46 AM, Andrew Ash
The biggest issue I've come across is that the cluster is somewhat unstable
when under memory pressure. Meaning that if you attempt to persist an RDD
that's too big for memory, even with MEMORY_AND_DISK, you'll often still
get OOMs. I had to carefully modify some of the space tuning parameters
The naive way would be to put all the users and their attributes into an
RDD, then cartesian product that with itself. Run the similarity score on
every pair (1M * 1M = 1T scores), map to (user, (score, otherUser)) and
take the .top(k) for each user.
I doubt that you'll be able to take this
once, so as long as
I get it done (even if it's a more manual process than it should be) is ok.
Hope that helps!
Andrew
On Sun, Apr 13, 2014 at 4:33 PM, Jim Blomo jim.bl...@gmail.com wrote:
On Thu, Apr 10, 2014 at 12:24 PM, Andrew Ash and...@andrewash.com wrote:
The biggest issue I've come
A lot of your time is being spent in garbage collection (second image).
Maybe your dataset doesn't easily fit into memory? Can you reduce the
number of new objects created in myFun?
How big are your heap sizes?
Another observation is that in the 4th image some of your RDDs are massive
and some
Are your IPRanges all on nice, even CIDR-format ranges? E.g. 192.168.0.0/16or
10.0.0.0/8?
If the range is always an even subnet mask and not split across subnets,
I'd recommend flatMapping the ipToUrl RDD to (IPRange, String) and then
joining the two RDDs. The expansion would be at most 32x if
Hi Spark users,
I've always done all my Spark work in Scala, but occasionally people ask
about Python and its performance impact vs the same algorithm
implementation in Scala.
Has anyone done tests to measure the difference?
Anecdotally I've heard Python is a 40% slowdown but that's entirely
a
mapPartitions() operation to let you do whatever you want with a partition.
What I need is a way to get my hands on two partitions at once, each from
different RDDs.
Any ideas?
Thanks,
Roger
On Mon, Apr 14, 2014 at 5:45 PM, Andrew Ash and...@andrewash.com wrote:
Are your IPRanges all
The homepage for Ooyala's job server is here:
https://github.com/ooyala/spark-jobserver
They decided (I think with input from the Spark team) that it made more
sense to keep the jobserver in a separate repository for now.
Andrew
On Fri, Apr 18, 2014 at 5:42 AM, Azuryy Yu azury...@gmail.com
That thread was mostly about benchmarking YARN vs standalone, and the
results are what I'd expect -- spinning up a Spark cluster on demand
through YARN has higher startup latency than using a standalone cluster,
where the JVMs are already initialized and ready.
Given that there's a lot more
, at 12:08 PM, Andrew Ash and...@andrewash.com wrote:
That thread was mostly about benchmarking YARN vs standalone, and the
results are what I'd expect -- spinning up a Spark cluster on demand
through YARN has higher startup latency than using a standalone cluster,
where the JVMs are already
For the second question, you can submit multiple jobs through the same
SparkContext via different threads and this is a supported way of
interacting with Spark.
From the documentation:
Second, *within* each Spark application, multiple “jobs” (Spark actions)
may be running concurrently if they
The problem is that equally-sized partitions take variable time to complete
based on their contents?
Sent from my mobile phone
On May 1, 2014 8:31 AM, deenar.toraskar deenar.toras...@db.com wrote:
Hi
I am using Spark to distribute computationally intensive tasks across the
cluster. Currently
Deenar,
I haven't heard of any activity to do partitioning in that way, but it does
seem more broadly valuable.
On Fri, May 2, 2014 at 10:15 AM, deenar.toraskar deenar.toras...@db.comwrote:
I have equal sized partitions now, but I want the RDD to be partitioned
such
that the partitions are
= {}) // Force evaluation
oldRdd.unpersist(true)
According to my usage pattern i tried to don't unpersist the intermediate
RDDs (i.e. oldRdd) but nothing change.
Any hints? How could i debug this?
2014-04-14 12:55 GMT+02:00 Andrew Ash and...@andrewash.com:
A lot of your time is being
Hi Weide,
The answer to your first question about local[2] can be found in the
Running the Examples and Shell section of
https://spark.apache.org/docs/latest/
Note that all of the sample programs take a master parameter specifying
the cluster URL to connect to. This can be a URL for a
Hi Eduardo,
Yep those machines look pretty well synchronized at this point. Just
wanted to throw that out there and eliminate it as a possible source of
confusion.
Good luck on continuing the debugging!
Andrew
On Sat, May 3, 2014 at 11:59 AM, Eduardo Costa Alfaia
e.costaalf...@unibs.it
Are you setting a core limit with spark.cores.max? If you don't, in coarse
mode each Spark job uses all available cores on Mesos and doesn't let them
go until the job is terminated. At which point the other job can access
the cores.
https://spark.apache.org/docs/latest/running-on-mesos.html --
There's an undocumented mode that looks like it simulates a cluster:
SparkContext.scala:
// Regular expression for simulating a Spark cluster of [N, cores,
memory] locally
val LOCAL_CLUSTER_REGEX =
local-cluster\[\s*([0-9]+)\s*,\s*([0-9]+)\s*,\s*([0-9]+)\s*].r
can you running your tests
, 2014 at 9:19 PM, Andrew Ash and...@andrewash.com wrote:
Hi all,
Is anyone reading and writing to .bz2 files stored in HDFS from Spark
with
success?
I'm finding the following results on a recent commit (756c96 from 24hr
ago)
and CDH 4.4.0:
Works: val r = sc.textFile(/user/aa
They are different terminology for the same thing and should be
interchangeable.
On Fri, May 16, 2014 at 2:02 PM, Robert James srobertja...@gmail.comwrote:
What is the difference between a Spark Worker and a Spark Slave?
Spark's
sc.textFile()https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L456
method
delegates to sc.hadoopFile(), which uses Hadoop's
Hi Shangyu (and everyone else looking to unsubscribe!),
If you'd like to get off this mailing list, please send an email to user
*-unsubscribe*@spark.apache.org, not the regular user@spark.apache.org list.
How to use the Apache mailing list infrastructure is documented here:
?
I know many people might not like it, but maybe the list messages
should have a footer with this administrative info (even if it's just
a link to the archive page)?
On Sun, May 18, 2014 at 1:49 PM, Andrew Ash and...@andrewash.com wrote:
If you'd like to get off this mailing list, please send
Hi yxzhao,
Those are branches in the source code git repository. You can get to them
with git checkout branch-1.0 once you've cloned the git repository.
Cheers,
Andrew
On Mon, May 19, 2014 at 8:30 PM, yxzhao yxz...@ualr.edu wrote:
Thanks Xiangrui,
Sorry I am new for Spark, could
with multiple cores.
2) BZip2 files are big enough or minPartitions is large enough when
you load the file via sc.textFile(), so that one worker has more than
one tasks.
Best,
Xiangrui
On Fri, May 16, 2014 at 4:06 PM, Andrew Ash and...@andrewash.com
wrote:
Hi Xiangrui,
// FYI I'm
Is your RDD of Strings? If so, you should make sure to use the Kryo
serializer instead of the default Java one. It stores strings as UTF8
rather than Java's default UTF16 representation, which can save you half
the memory usage in the right situation.
Try setting the persistence level on the
Hi Puneet,
If you're not going to read/write data in HDFS from your Spark cluster,
then it doesn't matter which one you download. Just go with Hadoop 2 as
that's more likely to connect to an HDFS cluster in the future if you ever
do decide to use HDFS because it's the newer APIs.
Cheers,
Andrew
If the distribution of the keys in your groupByKey is skewed (some keys
appear way more often than others) you should consider modifying your job
to use reduceByKey instead wherever possible.
On May 20, 2014 12:53 PM, Jon Keebler jkeeble...@gmail.com wrote:
So we upped the spark.akka.frameSize
Here's the 1.0.0rc9 version of the docs:
https://people.apache.org/~pwendell/spark-1.0.0-rc9-docs/running-on-mesos.html
I refreshed them with the goal of steering users more towards prebuilt
packages than relying on compiling from source plus improving overall
formatting and clarity, but not
Hi Mohit,
The log line about the ExternalAppendOnlyMap is more of a symptom of
slowness than causing slowness itself. The ExternalAppendOnlyMap is used
when a shuffle is causing too much data to be held in memory. Rather than
OOM'ing, Spark writes the data out to disk in a sorted order and
One thing you can try is to pull each file out of S3 and decompress with
gzip -d to see if it works. I'm guessing there's a corrupted .gz file
somewhere in your path glob.
Andrew
On Wed, May 21, 2014 at 12:40 PM, Michael Cutler mich...@tumra.com wrote:
Hi Nick,
Which version of Hadoop are
at 8:36 PM, Andrew Ash and...@andrewash.com wrote:
Hi Mohit,
The log line about the ExternalAppendOnlyMap is more of a symptom of
slowness than causing slowness itself. The ExternalAppendOnlyMap is used
when a shuffle is causing too much data to be held in memory. Rather than
OOM'ing, Spark
Hi everyone,
I've also been interested in better understanding what ports are used where
and the direction the network connections go. I've observed a running
cluster and read through code, and came up with the below documentation
addition.
https://github.com/apache/spark/pull/856
Scott and
Hi Jamal,
I don't believe there are pre-written algorithms for Cosine similarity or
Pearson Porrelation in PySpark that you can re-use. If you end up writing
your own implementation of the algorithm though, the project would
definitely appreciate if you shared that code back with the project for
.
Martin
Am 13.05.2014 08:48, schrieb Andrew Ash:
Are you setting a core limit with spark.cores.max? If you don't, in
coarse mode each Spark job uses all available cores on Mesos and doesn't
let them go until the job is terminated. At which point the other job can
access the cores.
https
Hi Randy,
In Spark 1.0 there was a lot of work done to allow unpersisting data that's
no longer needed. See the below pull request.
Try running kvGlobal.unpersist() on line 11 before the re-broadcast of the
next variable to see if you can cut the dependency there.
Hi Andrea,
What version of Spark are you using? There were some improvements in how
Spark uses Kryo in 0.9.1 and to-be 1.0 that I would expect to improve this.
Also, can you share your registrator's code?
Another possibility is that Kryo can have some difficulty serializing very
large objects.
it aligns!
Jacob
Jacob D. Eisinger
IBM Emerging Technologies
jeis...@us.ibm.com - (512) 286-6075
[image: Inactive hide details for Andrew Ash ---05/23/2014 10:30:58
AM---Hi everyone, I've also been interested in better understanding]Andrew
Ash ---05/23/2014 10:30:58 AM---Hi everyone, I've also
Also see this context from February. We started working with Chill to get
Avro records automatically registered with Kryo. I'm not sure the final
status, but from the Chill PR #172 it looks like this might be much less
friction than before.
Issue we filed:
Hi Carter,
In Spark 1.0 there will be an implementation of k-means available as part
of MLLib. You can see the documentation for that below (until 1.0 is fully
released).
https://people.apache.org/~pwendell/spark-1.0.0-rc9-docs/mllib-clustering.html
Maybe diving into the source here will help
Your applications are probably not connecting to your existing cluster and
instead running in local mode. Are you passing the master URL to the
SparkPi application?
Andrew
On Tue, Jun 3, 2014 at 12:30 AM, MrAsanjar . afsan...@gmail.com wrote:
- HI all,
- Application running and
current conclusion is that the best option would be to roll an own
saveHdfsFile(...)
Would you agree?
-greetz, Gerard.
[1]
http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job
On Mon, Jun 2, 2014 at 11:44 PM, Andrew Ash and...@andrewash.com wrote
Hi Mayur, is that closure cleaning a JVM issue or a Spark issue? I'm used
to thinking of closure cleaner as something Spark built. Do you have
somewhere I can read more about this?
On Tue, Jun 3, 2014 at 12:47 PM, Mayur Rustagi mayur.rust...@gmail.com
wrote:
So are you using Java 7 or 8.
7
Just curious, what do you want your custom RDD to do that the normal ones
don't?
On Wed, Jun 4, 2014 at 6:30 AM, bluejoe2008 bluejoe2...@gmail.com wrote:
hi, folks,
is there any easier way to define a custom RDD in Java?
I am wondering if I have to define a new java class which
nilmish,
To confirm your code is using kryo, go to the web ui of your application
(defaults to :4040) and look at the environment tab. If your serializer
settings are there then things should be working properly.
I'm not sure how to confirm that it works against typos in the setting, but
you
You can change storage level on an individual RDD with
.persist(StorageLevel.MEMORY_AND_DISK), but I don't think you can change
what the default persistency level is for RDDs.
Andrew
On Wed, Jun 4, 2014 at 1:52 AM, Salih Kardan karda...@gmail.com wrote:
Hi
I'm using Spark 0.9.1 and Shark
When you group by IP address in step 1 to this:
(ip1,(lat1,lon1),(lat2,lon2))
(ip2,(lat3,lon3),(lat4,lat5))
How many lat/lon locations do you expect for each IP address? avg and max
are interesting.
Andrew
On Wed, Jun 4, 2014 at 5:29 AM, Oleg Proudnikov
Hi Oleg,
I set the size of my executors on a standalone cluster when using the shell
like this:
./bin/spark-shell --master $MASTER --total-executor-cores
$CORES_ACROSS_CLUSTER --driver-java-options
-Dspark.executor.memory=$MEMORY_PER_EXECUTOR
It doesn't seem particularly clean, but it works.
-Dspark.executor.memory=$MEMORY_PER_EXECUTOR
I get
bad option: '--driver-java-options'
There must be something different in my setup. Any ideas?
Thank you again,
Oleg
On 5 June 2014 22:28, Andrew Ash and...@andrewash.com wrote:
Hi Oleg,
I set the size of my executors on a standalone cluster when
Hi Ajay,
Can you please try running the same code with spark.shuffle.spill=false and
see if the numbers turn out correctly? That parameter controls whether or
not the buggy code that Matei fixed in ExternalAppendOnlyMap is used.
FWIW I saw similar issues in 0.9.0 but no longer in 0.9.1 after I
Hi Roger,
You should be able to sort within partitions using the rdd.mapPartitions()
method, and that shouldn't require holding all data in memory at once. It
does require holding the entire partition in memory though. Do you need
the partition to never be held in memory all at once?
As far as
Andrew,
This is a standalone cluster. And, yes, if my understanding of Spark
terminology is correct, you are correct about the port ownerships.
Jacob
Jacob D. Eisinger
IBM Emerging Technologies
jeis...@us.ibm.com - (512) 286-6075
[image: Inactive hide details for Andrew Ash ---05/28
Gerard,
Strings in particular are very inefficient because they're stored in a
two-byte format by the JVM. If you use the Kryo serializer and have use
StorageLevel.MEMORY_ONLY_SER then Kryo stores Strings in UTF8, which for
ASCII-like strings will take half the space.
Andrew
On Tue, Jun 17,
In Spark you can use the normal globs supported by Hadoop's FileSystem,
which are documented here:
http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/fs/FileSystem.html#globStatus(org.apache.hadoop.fs.Path)
On Wed, Jun 18, 2014 at 12:09 AM, MEETHU MATHEW meethu2...@yahoo.co.in
wrote:
What's the advantage of Apache maintaining the brew installer vs users?
Apache handling it means more work on this dev team, but probably a better
experience for brew users. Just wanted to weigh pros/cons before
committing to support this installation method.
Andrew
On Wed, Jun 18, 2014 at
Wait, so the file only has four lines and the job running out of heap
space? Can you share the code you're running that does the processing?
I'd guess that you're doing some intense processing on every line but just
writing parsed case classes back to disk sounds very lightweight.
I
On Wed,
Sounds good. Mingyu and I are waiting on 1.0.1 to get the fix for the
below issues without running a patched version of Spark:
https://issues.apache.org/jira/browse/SPARK-1935 -- commons-codec version
conflicts for client applications
https://issues.apache.org/jira/browse/SPARK-2043 --
Hi Sameer,
If you set those two IDs to be a Tuple2 in the key of the RDD, then you can
join on that tuple.
Example:
val rdd1: RDD[Tuple3[Int, Int, String]] = ...
val rdd2: RDD[Tuple3[Int, Int, String]] = ...
val resultRDD = rdd1.map(k = ((k._1, k._2), k._3)).join(
rdd2.map(k =
Ni Nick,
The cluster I was working on in those linked messages was a private data
center cluster, not on EC2. I'd imagine that the setup would be pretty
similar, but I'm not familiar with the EC2 init scripts that Spark uses.
Also I upgraded that cluster to 1.0 recently and am continuing to use
In general it would be nice to be able to configure replication on a
per-job basis. Is there a way to do that without changing the config
values in the Hadoop conf/ directory between jobs? Maybe by modifying
OutputFormats or the JobConf ?
On Mon, Jul 14, 2014 at 11:12 PM, Matei Zaharia
Hi Nan,
Great digging in -- that makes sense to me for when a job is producing some
output handled by Spark like a .count or .distinct or similar.
For the other part of the question, I'm also interested in side effects
like an HDFS disk write. If one task is writing to an HDFS path and
another
I'm not sure if you guys ever picked a preferred method for doing this, but
I just encountered it and came up with this method that's working
reasonably well on a small dataset. It should be quite easily
generalizable to non-String RDDs.
def addRowNumber(r: RDD[String]): RDD[Tuple2[Long,String]]
Hi Martin,
In standalone mode, each SparkContext you initialize gets its own set of
executors across the cluster. So for example if you have two shells open,
they'll each get two JVMs on each worker machine in the cluster.
As far as the other docs, you can configure the total number of cores
Hi Oleg,
Did you ever figure this out? I'm observing the same exception also in
0.9.1 and think it might be related to setting spark.speculation=true. My
theory is that multiple attempts at the same task start, the first finishes
and cleans up the _temporary directory, and then the second fails
Hi Jikai,
It looks like you're trying to run a Spark job on data that's stored in
HDFS in .lzo format. Spark can handle this (I do it all the time), but you
need to configure your Spark installation to know about the .lzo format.
There are two parts to the hadoop lzo library -- the first is the
Yes, I've done it before.
On Thu, Aug 7, 2014 at 10:18 PM, Gary Zhao garyz...@gmail.com wrote:
Hello
Is it possible to use spark-cassandra-connector in spark-shell?
Thanks
Gary
7, 2014 at 10:20 PM, Andrew Ash and...@andrewash.com wrote:
Yes, I've done it before.
On Thu, Aug 7, 2014 at 10:18 PM, Gary Zhao garyz...@gmail.com wrote:
Hello
Is it possible to use spark-cassandra-connector in spark-shell?
Thanks
Gary
I've also been seeing similar stacktraces on Spark core (not streaming) and
have a theory it's related to spark.speculation being turned on. Do you
have that enabled by chance?
On Mon, Aug 11, 2014 at 8:10 AM, Chen Song chen.song...@gmail.com wrote:
Bill
Did you get this resolved somehow?
:13 AM, Andrew Ash and...@andrewash.com wrote:
I've also been seeing similar stacktraces on Spark core (not streaming)
and have a theory it's related to spark.speculation being turned on. Do
you have that enabled by chance?
On Mon, Aug 11, 2014 at 8:10 AM, Chen Song chen.song...@gmail.com
Hi Chen,
Please see the bug I filed at
https://issues.apache.org/jira/browse/SPARK-2984 with the
FileNotFoundException on _temporary directory issue.
Andrew
On Mon, Aug 11, 2014 at 10:50 PM, Andrew Ash and...@andrewash.com wrote:
Not sure which stalled HDFS client issue your'e referring
// assuming Spark 1.0
Hi Baoqiang,
In my experience for the standalone cluster you need to set
SPARK_WORKER_DIR not SPARK_LOCAL_DIRS to control where shuffle files are
written. I think this is a documentation issue that could be improved, as
Hi Deb,
If you don't have long-running Spark applications (those taking more than
spark.worker.cleanup.appDataTtl) then the TTL-based cleaner is a good
solution. If however you have a mix of long-running and short-running
applications, then the TTL-based solution will fail. It will clean up
What happens when a run of numbers is spread across a partition boundary?
I think you might end up with two adjacent groups of the same value in
that situation.
On Mon, Aug 18, 2014 at 2:05 AM, Davies Liu dav...@databricks.com wrote:
import itertools
l = [1,1,1,2,2,3,4,4,5,1]
gs =
I'm actually not sure the Spark+Mesos integration supports dynamically
allocating memory (it does support dynamically allocating cores though).
Has anyone here actually used Spark+Mesos on heterogenous hardware and
done dynamic memory allocation?
My understanding is that each Spark executor
/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackend.scala#L114
- where Spark accepts sc.executorMemory of a resource offer, regardless of
how much more memory was available
On Thu, Aug 21, 2014 at 2:12 PM, Andrew Ash and...@andrewash.com
Hi Patrick,
For the spilling within on key work you mention might land in Spark 1.2, is
that being tracked in https://issues.apache.org/jira/browse/SPARK-1823 or
is there another ticket I should be following?
Thanks!
Andrew
On Tue, Aug 5, 2014 at 3:39 PM, Patrick Wendell pwend...@gmail.com
Hi Grega,
Did you ever get this figured out? I'm observing the same issue in Spark
1.0.2.
For me it was after 1.5hr of a large .distinct call, followed by a
.saveAsTextFile()
14/08/26 20:57:43 INFO executor.CoarseGrainedExecutorBackend: Got assigned
task 18500
14/08/26 20:57:43 INFO
Hi Dhimant,
We also cleaned up these needless warnings on port failover in Spark 1.1 --
see https://issues.apache.org/jira/browse/SPARK-1902
Andrew
On Thu, Sep 4, 2014 at 7:38 AM, Dhimant dhimant84.jays...@gmail.com wrote:
Thanks Yana,
I am able to execute application and command via
nicholas.cham...@gmail.com wrote:
Andrew,
This email was pretty helpful. I feel like this stuff should be
summarized
in the docs somewhere, or perhaps in a blog post.
Do you know if it is?
Nick
On Thu, Jun 5, 2014 at 6:36 PM, Andrew Ash and...@andrewash.com wrote:
The locality
Hi Nicolas,
I've had suspicions about speculation causing problems on my cluster but
don't have any hard evidence of it yet.
I'm also interested in why it's turned off by default.
On Tue, Sep 16, 2014 at 3:01 PM, Nicolas Mai nicolas@gmail.com wrote:
Hi, guys
My current project is using
Hi Harsha,
You could look through the GraphX source to see the approach taken there
for ideas in your own. I'd recommend starting at
https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/Graph.scala#L385
to see the storage technique.
Why do you want to avoid
Hi Burak,
Most discussions of checkpointing in the docs is related to Spark
streaming. Are you talking about the sparkContext.setCheckpointDir()?
What effect does that have?
https://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing
On Wed, Sep 17, 2014 at 7:44 AM,
Thanks for the info!
Are there performance impacts with writing to HDFS instead of local disk?
I'm assuming that's why ALS checkpoints every third iteration instead of
every iteration.
Also I can imagine that checkpointing should be done every N shuffles
instead of every N operations (counting
in Spark Streaming, and some MLlib
algorithms. If you can help with the guide, I think it would be a nice
feature to have!
Burak
- Original Message -
From: Andrew Ash and...@andrewash.com
To: Burak Yavuz bya...@stanford.edu
Cc: Макар Красноперов connector@gmail.com, user
user
1 - 100 of 148 matches
Mail list logo