Re: ExternalAppendOnlyMap: Spilling in-memory map

2014-05-22 Thread Andrew Ash
at 8:36 PM, Andrew Ash and...@andrewash.com wrote: Hi Mohit, The log line about the ExternalAppendOnlyMap is more of a symptom of slowness than causing slowness itself. The ExternalAppendOnlyMap is used when a shuffle is causing too much data to be held in memory. Rather than OOM'ing, Spark

Re: ClassNotFoundException with Spark/Mesos (spark-shell works fine)

2014-05-21 Thread Andrew Ash
Here's the 1.0.0rc9 version of the docs: https://people.apache.org/~pwendell/spark-1.0.0-rc9-docs/running-on-mesos.html I refreshed them with the goal of steering users more towards prebuilt packages than relying on compiling from source plus improving overall formatting and clarity, but not

Re: ExternalAppendOnlyMap: Spilling in-memory map

2014-05-21 Thread Andrew Ash
Hi Mohit, The log line about the ExternalAppendOnlyMap is more of a symptom of slowness than causing slowness itself. The ExternalAppendOnlyMap is used when a shuffle is causing too much data to be held in memory. Rather than OOM'ing, Spark writes the data out to disk in a sorted order and

Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-21 Thread Andrew Ash
One thing you can try is to pull each file out of S3 and decompress with gzip -d to see if it works. I'm guessing there's a corrupted .gz file somewhere in your path glob. Andrew On Wed, May 21, 2014 at 12:40 PM, Michael Cutler mich...@tumra.com wrote: Hi Nick, Which version of Hadoop are

Re: Spark and Hadoop

2014-05-20 Thread Andrew Ash
Hi Puneet, If you're not going to read/write data in HDFS from your Spark cluster, then it doesn't matter which one you download. Just go with Hadoop 2 as that's more likely to connect to an HDFS cluster in the future if you ever do decide to use HDFS because it's the newer APIs. Cheers, Andrew

Re: Spark stalling during shuffle (maybe a memory issue)

2014-05-20 Thread Andrew Ash
If the distribution of the keys in your groupByKey is skewed (some keys appear way more often than others) you should consider modifying your job to use reduceByKey instead wherever possible. On May 20, 2014 12:53 PM, Jon Keebler jkeeble...@gmail.com wrote: So we upped the spark.akka.frameSize

Re: unsubscribe

2014-05-19 Thread Andrew Ash
? I know many people might not like it, but maybe the list messages should have a footer with this administrative info (even if it's just a link to the archive page)? On Sun, May 18, 2014 at 1:49 PM, Andrew Ash and...@andrewash.com wrote: If you'd like to get off this mailing list, please send

Re: How to run the SVM and LogisticRegression

2014-05-19 Thread Andrew Ash
Hi yxzhao, Those are branches in the source code git repository. You can get to them with git checkout branch-1.0 once you've cloned the git repository. Cheers, Andrew On Mon, May 19, 2014 at 8:30 PM, yxzhao yxz...@ualr.edu wrote: Thanks Xiangrui, Sorry I am new for Spark, could

Re: Reading from .bz2 files with Spark

2014-05-19 Thread Andrew Ash
with multiple cores. 2) BZip2 files are big enough or minPartitions is large enough when you load the file via sc.textFile(), so that one worker has more than one tasks. Best, Xiangrui On Fri, May 16, 2014 at 4:06 PM, Andrew Ash and...@andrewash.com wrote: Hi Xiangrui, // FYI I'm

Re: Problem when sorting big file

2014-05-19 Thread Andrew Ash
Is your RDD of Strings? If so, you should make sure to use the Kryo serializer instead of the default Java one. It stores strings as UTF8 rather than Java's default UTF16 representation, which can save you half the memory usage in the right situation. Try setting the persistence level on the

Re: File list read into single RDD

2014-05-18 Thread Andrew Ash
Spark's sc.textFile()https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L456 method delegates to sc.hadoopFile(), which uses Hadoop's

Re: unsubscribe

2014-05-18 Thread Andrew Ash
Hi Shangyu (and everyone else looking to unsubscribe!), If you'd like to get off this mailing list, please send an email to user *-unsubscribe*@spark.apache.org, not the regular user@spark.apache.org list. How to use the Apache mailing list infrastructure is documented here:

Re: Reading from .bz2 files with Spark

2014-05-16 Thread Andrew Ash
, 2014 at 9:19 PM, Andrew Ash and...@andrewash.com wrote: Hi all, Is anyone reading and writing to .bz2 files stored in HDFS from Spark with success? I'm finding the following results on a recent commit (756c96 from 24hr ago) and CDH 4.4.0: Works: val r = sc.textFile(/user/aa

Re: What is the difference between a Spark Worker and a Spark Slave?

2014-05-16 Thread Andrew Ash
They are different terminology for the same thing and should be interchangeable. On Fri, May 16, 2014 at 2:02 PM, Robert James srobertja...@gmail.comwrote: What is the difference between a Spark Worker and a Spark Slave?

Re: Spark unit testing best practices

2014-05-14 Thread Andrew Ash
There's an undocumented mode that looks like it simulates a cluster: SparkContext.scala: // Regular expression for simulating a Spark cluster of [N, cores, memory] locally val LOCAL_CLUSTER_REGEX = local-cluster\[\s*([0-9]+)\s*,\s*([0-9]+)\s*,\s*([0-9]+)\s*].r can you running your tests

Re: Dead lock running multiple Spark jobs on Mesos

2014-05-13 Thread Andrew Ash
Are you setting a core limit with spark.cores.max? If you don't, in coarse mode each Spark job uses all available cores on Mesos and doesn't let them go until the job is terminated. At which point the other job can access the cores. https://spark.apache.org/docs/latest/running-on-mesos.html --

Re: what's local[n]

2014-05-03 Thread Andrew Ash
Hi Weide, The answer to your first question about local[2] can be found in the Running the Examples and Shell section of https://spark.apache.org/docs/latest/ Note that all of the sample programs take a master parameter specifying the cluster URL to connect to. This can be a URL for a

Re: Spark's behavior

2014-05-03 Thread Andrew Ash
Hi Eduardo, Yep those machines look pretty well synchronized at this point. Just wanted to throw that out there and eliminate it as a possible source of confusion. Good luck on continuing the debugging! Andrew On Sat, May 3, 2014 at 11:59 AM, Eduardo Costa Alfaia e.costaalf...@unibs.it

Re: Equally weighted partitions in Spark

2014-05-02 Thread Andrew Ash
Deenar, I haven't heard of any activity to do partitioning in that way, but it does seem more broadly valuable. On Fri, May 2, 2014 at 10:15 AM, deenar.toraskar deenar.toras...@db.comwrote: I have equal sized partitions now, but I want the RDD to be partitioned such that the partitions are

Re: Incredible slow iterative computation

2014-05-02 Thread Andrew Ash
= {}) // Force evaluation oldRdd.unpersist(true) According to my usage pattern i tried to don't unpersist the intermediate RDDs (i.e. oldRdd) but nothing change. Any hints? How could i debug this? 2014-04-14 12:55 GMT+02:00 Andrew Ash and...@andrewash.com: A lot of your time is being

Re: Equally weighted partitions in Spark

2014-05-01 Thread Andrew Ash
The problem is that equally-sized partitions take variable time to complete based on their contents? Sent from my mobile phone On May 1, 2014 8:31 AM, deenar.toraskar deenar.toras...@db.com wrote: Hi I am using Spark to distribute computationally intensive tasks across the cluster. Currently

Re: launching concurrent jobs programmatically

2014-04-28 Thread Andrew Ash
For the second question, you can submit multiple jobs through the same SparkContext via different threads and this is a supported way of interacting with Spark. From the documentation: Second, *within* each Spark application, multiple “jobs” (Spark actions) may be running concurrently if they

Re: Spark on Yarn or Mesos?

2014-04-27 Thread Andrew Ash
That thread was mostly about benchmarking YARN vs standalone, and the results are what I'd expect -- spinning up a Spark cluster on demand through YARN has higher startup latency than using a standalone cluster, where the JVMs are already initialized and ready. Given that there's a lot more

Re: Spark on Yarn or Mesos?

2014-04-27 Thread Andrew Ash
, at 12:08 PM, Andrew Ash and...@andrewash.com wrote: That thread was mostly about benchmarking YARN vs standalone, and the results are what I'd expect -- spinning up a Spark cluster on demand through YARN has higher startup latency than using a standalone cluster, where the JVMs are already

Re: Ooyala Server - plans to merge it into Apache ?

2014-04-20 Thread Andrew Ash
The homepage for Ooyala's job server is here: https://github.com/ooyala/spark-jobserver They decided (I think with input from the Spark team) that it made more sense to keep the jobserver in a separate repository for now. Andrew On Fri, Apr 18, 2014 at 5:42 AM, Azuryy Yu azury...@gmail.com

Re: How to cogroup/join pair RDDs with different key types?

2014-04-16 Thread Andrew Ash
a mapPartitions() operation to let you do whatever you want with a partition. What I need is a way to get my hands on two partitions at once, each from different RDDs. Any ideas? Thanks, Roger On Mon, Apr 14, 2014 at 5:45 PM, Andrew Ash and...@andrewash.com wrote: Are your IPRanges all

Re: Incredible slow iterative computation

2014-04-14 Thread Andrew Ash
A lot of your time is being spent in garbage collection (second image). Maybe your dataset doesn't easily fit into memory? Can you reduce the number of new objects created in myFun? How big are your heap sizes? Another observation is that in the 4th image some of your RDDs are massive and some

Re: How to cogroup/join pair RDDs with different key types?

2014-04-14 Thread Andrew Ash
Are your IPRanges all on nice, even CIDR-format ranges? E.g. 192.168.0.0/16or 10.0.0.0/8? If the range is always an even subnet mask and not split across subnets, I'd recommend flatMapping the ipToUrl RDD to (IPRange, String) and then joining the two RDDs. The expansion would be at most 32x if

Scala vs Python performance differences

2014-04-14 Thread Andrew Ash
Hi Spark users, I've always done all my Spark work in Scala, but occasionally people ask about Python and its performance impact vs the same algorithm implementation in Scala. Has anyone done tests to measure the difference? Anecdotally I've heard Python is a 40% slowdown but that's entirely

Re: Spark - ready for prime time?

2014-04-13 Thread Andrew Ash
once, so as long as I get it done (even if it's a more manual process than it should be) is ok. Hope that helps! Andrew On Sun, Apr 13, 2014 at 4:33 PM, Jim Blomo jim.bl...@gmail.com wrote: On Thu, Apr 10, 2014 at 12:24 PM, Andrew Ash and...@andrewash.com wrote: The biggest issue I've come

Re: Huge matrix

2014-04-11 Thread Andrew Ash
The naive way would be to put all the users and their attributes into an RDD, then cartesian product that with itself. Run the similarity score on every pair (1M * 1M = 1T scores), map to (user, (score, otherUser)) and take the .top(k) for each user. I doubt that you'll be able to take this

Re: Spark - ready for prime time?

2014-04-10 Thread Andrew Ash
The biggest issue I've come across is that the cluster is somewhat unstable when under memory pressure. Meaning that if you attempt to persist an RDD that's too big for memory, even with MEMORY_AND_DISK, you'll often still get OOMs. I had to carefully modify some of the space tuning parameters

Re: trouble with join on large RDDs

2014-04-09 Thread Andrew Ash
A JVM can easily be limited in how much memory it uses with the -Xmx parameter, but Python doesn't have memory limits built in in such a first-class way. Maybe the memory limits aren't making it to the python executors. What was your SPARK_MEM setting? The JVM below seems to be using 603201

Re: Spark Disk Usage

2014-04-09 Thread Andrew Ash
For 1, persist can be used to save an RDD to disk using the various persistence levels. When a persistency level is set on an RDD, when that RDD is evaluated it's saved to memory/disk/elsewhere so that it can be re-used. It's applied to that RDD, so that subsequent uses of the RDD can use the

Re: Spark Disk Usage

2014-04-09 Thread Andrew Ash
the entire RDD has been realized in memory. Is that correct? -Suren On Wed, Apr 9, 2014 at 9:25 AM, Andrew Ash and...@andrewash.com wrote: For 1, persist can be used to save an RDD to disk using the various persistence levels. When a persistency level is set on an RDD, when that RDD

Re: How does Spark handle RDD via HDFS ?

2014-04-09 Thread Andrew Ash
The typical way to handle that use case would be to join the 3 files together into one RDD and then do the factorization on that. There will definitely be network traffic during the initial join to get everything into one table, and after that there will likely be more network traffic for various

Re: Spark Disk Usage

2014-04-09 Thread Andrew Ash
and then call persist on the resulting RDD. So I'm wondering if groupByKey is aware of the subsequent persist setting to use disk or just creates the Seq[V] in memory and only uses disk after that data structure is fully realized in memory. -Suren On Wed, Apr 9, 2014 at 9:46 AM, Andrew Ash

Re: Spark with SSL?

2014-04-08 Thread Andrew Ash
Not that I know of, but it would be great if that was supported. The way I typically handle security now is to put the Spark servers in their own subnet with strict inbound/outbound firewalls. On Tue, Apr 8, 2014 at 1:14 PM, kamatsuoka ken...@gmail.com wrote: Can Spark be configured to use

Re: Measuring Network Traffic for Spark Job

2014-04-08 Thread Andrew Ash
If you set up Spark's metrics reporting to write to the Ganglia backend that will give you a good idea of how much network/disk/CPU is being used and on what machines. https://spark.apache.org/docs/0.9.0/monitoring.html On Tue, Apr 8, 2014 at 12:57 PM, yxzhao yxz...@ualr.edu wrote: Hi All,

Re: How to execute a function from class in distributed jar on each worker node?

2014-04-08 Thread Andrew Ash
One thing you could do is create an RDD of [1,2,3] and set a partitioner that puts all three values on their own nodes. Then .foreach() over the RDD and call your function that will run on each node. Why do you need to run the function on every node? Is it some sort of setup code that needs to

Redirect Incubator pages

2014-04-05 Thread Andrew Ash
I occasionally see links to pages in the spark.incubator.apache.org domain. Can we HTTP 301 redirect that whole domain to spark.apache.org now that the project has graduated? The content seems identical. That would also make the eventual decommission of the incubator domain much easier as usage

Re: Strange behavior of RDD.cartesian

2014-03-29 Thread Andrew Ash
Is this spark 0.9.0? Try setting spark.shuffle.spill=false There was a hash collision bug that's fixed in 0.9.1 that might cause you to have too few results in that join. Sent from my mobile phone On Mar 28, 2014 8:04 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Weird, how exactly are you

Re: distinct in data frame in spark

2014-03-25 Thread Andrew Ash
My thought would be to key by the first item in each array, then take just one array for each key. Something like the below: v = sc.parallelize(Seq(Seq(1,2,3,4),Seq(1,5,2,3),Seq(2,3,4,5))) col = 0 output = v.keyBy(_(col)).reduceByKey(a,b = a).values On Tue, Mar 25, 2014 at 1:21 AM, Chengi Liu

Re: Akka error with largish job (works fine for smaller versions)

2014-03-25 Thread Andrew Ash
Possibly one of your executors is in the middle of a large stop-the-world GC and doesn't respond to network traffic during that period? If you shared some information about how each node in your cluster is set up (heap size, memory, CPU, etc) that might help with debugging. Andrew On Mon, Mar

Re: distinct on huge dataset

2014-03-22 Thread Andrew Ash
FWIW I've seen correctness errors with spark.shuffle.spill on 0.9.0 and have it disabled now. The specific error behavior was that a join would consistently return one count of rows with spill enabled and another count with it disabled. Sent from my mobile phone On Mar 22, 2014 1:52 PM, Kane

Re: Pyspark worker memory

2014-03-20 Thread Andrew Ash
Jim, I'm starting to document the heap size settings all in one place, which has been a confusion for a lot of my peers. Maybe you can take a look at this ticket? https://spark-project.atlassian.net/browse/SPARK-1264 On Wed, Mar 19, 2014 at 12:53 AM, Jim Blomo jim.bl...@gmail.com wrote: To

Re: Separating classloader management from SparkContexts

2014-03-19 Thread Andrew Ash
Hi Punya, This seems like a problem that the recently-announced job-server would likely have run into at one point. I haven't tested it yet, but I'd be interested to see what happens when two jobs in the job server have conflicting classes. Does the server correctly segregate each job's classes

Re: Job initialization performance of Spark standalone mode vs YARN

2014-03-03 Thread Andrew Ash
polkosity, have you seen the job server that Ooyala open sourced? I think it's very similar to what you're proposing with a REST API and re-using a SparkContext. https://github.com/apache/incubator-spark/pull/222 http://engineering.ooyala.com/blog/open-sourcing-our-spark-job-server On Mon, Mar

<    1   2