Re: Job initialization performance of Spark standalone mode vs YARN

2014-03-03 Thread Andrew Ash
polkosity, have you seen the job server that Ooyala open sourced? I think it's very similar to what you're proposing with a REST API and re-using a SparkContext. https://github.com/apache/incubator-spark/pull/222 http://engineering.ooyala.com/blog/open-sourcing-our-spark-job-server On Mon, Mar

Re: Separating classloader management from SparkContexts

2014-03-19 Thread Andrew Ash
Hi Punya, This seems like a problem that the recently-announced job-server would likely have run into at one point. I haven't tested it yet, but I'd be interested to see what happens when two jobs in the job server have conflicting classes. Does the server correctly segregate each job's classes

Re: Pyspark worker memory

2014-03-20 Thread Andrew Ash
Jim, I'm starting to document the heap size settings all in one place, which has been a confusion for a lot of my peers. Maybe you can take a look at this ticket? https://spark-project.atlassian.net/browse/SPARK-1264 On Wed, Mar 19, 2014 at 12:53 AM, Jim Blomo jim.bl...@gmail.com wrote: To

Re: distinct on huge dataset

2014-03-22 Thread Andrew Ash
FWIW I've seen correctness errors with spark.shuffle.spill on 0.9.0 and have it disabled now. The specific error behavior was that a join would consistently return one count of rows with spill enabled and another count with it disabled. Sent from my mobile phone On Mar 22, 2014 1:52 PM, Kane

Re: distinct in data frame in spark

2014-03-25 Thread Andrew Ash
My thought would be to key by the first item in each array, then take just one array for each key. Something like the below: v = sc.parallelize(Seq(Seq(1,2,3,4),Seq(1,5,2,3),Seq(2,3,4,5))) col = 0 output = v.keyBy(_(col)).reduceByKey(a,b = a).values On Tue, Mar 25, 2014 at 1:21 AM, Chengi Liu

Re: Akka error with largish job (works fine for smaller versions)

2014-03-25 Thread Andrew Ash
Possibly one of your executors is in the middle of a large stop-the-world GC and doesn't respond to network traffic during that period? If you shared some information about how each node in your cluster is set up (heap size, memory, CPU, etc) that might help with debugging. Andrew On Mon, Mar

Re: Strange behavior of RDD.cartesian

2014-03-29 Thread Andrew Ash
Is this spark 0.9.0? Try setting spark.shuffle.spill=false There was a hash collision bug that's fixed in 0.9.1 that might cause you to have too few results in that join. Sent from my mobile phone On Mar 28, 2014 8:04 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Weird, how exactly are you

Redirect Incubator pages

2014-04-05 Thread Andrew Ash
I occasionally see links to pages in the spark.incubator.apache.org domain. Can we HTTP 301 redirect that whole domain to spark.apache.org now that the project has graduated? The content seems identical. That would also make the eventual decommission of the incubator domain much easier as usage

Re: Spark with SSL?

2014-04-08 Thread Andrew Ash
Not that I know of, but it would be great if that was supported. The way I typically handle security now is to put the Spark servers in their own subnet with strict inbound/outbound firewalls. On Tue, Apr 8, 2014 at 1:14 PM, kamatsuoka ken...@gmail.com wrote: Can Spark be configured to use

Re: Measuring Network Traffic for Spark Job

2014-04-08 Thread Andrew Ash
If you set up Spark's metrics reporting to write to the Ganglia backend that will give you a good idea of how much network/disk/CPU is being used and on what machines. https://spark.apache.org/docs/0.9.0/monitoring.html On Tue, Apr 8, 2014 at 12:57 PM, yxzhao yxz...@ualr.edu wrote: Hi All,

Re: How to execute a function from class in distributed jar on each worker node?

2014-04-08 Thread Andrew Ash
One thing you could do is create an RDD of [1,2,3] and set a partitioner that puts all three values on their own nodes. Then .foreach() over the RDD and call your function that will run on each node. Why do you need to run the function on every node? Is it some sort of setup code that needs to

Re: trouble with join on large RDDs

2014-04-09 Thread Andrew Ash
A JVM can easily be limited in how much memory it uses with the -Xmx parameter, but Python doesn't have memory limits built in in such a first-class way. Maybe the memory limits aren't making it to the python executors. What was your SPARK_MEM setting? The JVM below seems to be using 603201

Re: Spark Disk Usage

2014-04-09 Thread Andrew Ash
For 1, persist can be used to save an RDD to disk using the various persistence levels. When a persistency level is set on an RDD, when that RDD is evaluated it's saved to memory/disk/elsewhere so that it can be re-used. It's applied to that RDD, so that subsequent uses of the RDD can use the

Re: Spark Disk Usage

2014-04-09 Thread Andrew Ash
the entire RDD has been realized in memory. Is that correct? -Suren On Wed, Apr 9, 2014 at 9:25 AM, Andrew Ash and...@andrewash.com wrote: For 1, persist can be used to save an RDD to disk using the various persistence levels. When a persistency level is set on an RDD, when that RDD

Re: How does Spark handle RDD via HDFS ?

2014-04-09 Thread Andrew Ash
The typical way to handle that use case would be to join the 3 files together into one RDD and then do the factorization on that. There will definitely be network traffic during the initial join to get everything into one table, and after that there will likely be more network traffic for various

Re: Spark Disk Usage

2014-04-09 Thread Andrew Ash
and then call persist on the resulting RDD. So I'm wondering if groupByKey is aware of the subsequent persist setting to use disk or just creates the Seq[V] in memory and only uses disk after that data structure is fully realized in memory. -Suren On Wed, Apr 9, 2014 at 9:46 AM, Andrew Ash

Re: Spark - ready for prime time?

2014-04-10 Thread Andrew Ash
The biggest issue I've come across is that the cluster is somewhat unstable when under memory pressure. Meaning that if you attempt to persist an RDD that's too big for memory, even with MEMORY_AND_DISK, you'll often still get OOMs. I had to carefully modify some of the space tuning parameters

Re: Huge matrix

2014-04-11 Thread Andrew Ash
The naive way would be to put all the users and their attributes into an RDD, then cartesian product that with itself. Run the similarity score on every pair (1M * 1M = 1T scores), map to (user, (score, otherUser)) and take the .top(k) for each user. I doubt that you'll be able to take this

Re: Spark - ready for prime time?

2014-04-13 Thread Andrew Ash
once, so as long as I get it done (even if it's a more manual process than it should be) is ok. Hope that helps! Andrew On Sun, Apr 13, 2014 at 4:33 PM, Jim Blomo jim.bl...@gmail.com wrote: On Thu, Apr 10, 2014 at 12:24 PM, Andrew Ash and...@andrewash.com wrote: The biggest issue I've come

Re: Incredible slow iterative computation

2014-04-14 Thread Andrew Ash
A lot of your time is being spent in garbage collection (second image). Maybe your dataset doesn't easily fit into memory? Can you reduce the number of new objects created in myFun? How big are your heap sizes? Another observation is that in the 4th image some of your RDDs are massive and some

Re: How to cogroup/join pair RDDs with different key types?

2014-04-14 Thread Andrew Ash
Are your IPRanges all on nice, even CIDR-format ranges? E.g. 192.168.0.0/16or 10.0.0.0/8? If the range is always an even subnet mask and not split across subnets, I'd recommend flatMapping the ipToUrl RDD to (IPRange, String) and then joining the two RDDs. The expansion would be at most 32x if

Scala vs Python performance differences

2014-04-14 Thread Andrew Ash
Hi Spark users, I've always done all my Spark work in Scala, but occasionally people ask about Python and its performance impact vs the same algorithm implementation in Scala. Has anyone done tests to measure the difference? Anecdotally I've heard Python is a 40% slowdown but that's entirely

Re: How to cogroup/join pair RDDs with different key types?

2014-04-16 Thread Andrew Ash
a mapPartitions() operation to let you do whatever you want with a partition. What I need is a way to get my hands on two partitions at once, each from different RDDs. Any ideas? Thanks, Roger On Mon, Apr 14, 2014 at 5:45 PM, Andrew Ash and...@andrewash.com wrote: Are your IPRanges all

Re: Ooyala Server - plans to merge it into Apache ?

2014-04-20 Thread Andrew Ash
The homepage for Ooyala's job server is here: https://github.com/ooyala/spark-jobserver They decided (I think with input from the Spark team) that it made more sense to keep the jobserver in a separate repository for now. Andrew On Fri, Apr 18, 2014 at 5:42 AM, Azuryy Yu azury...@gmail.com

Re: Spark on Yarn or Mesos?

2014-04-27 Thread Andrew Ash
That thread was mostly about benchmarking YARN vs standalone, and the results are what I'd expect -- spinning up a Spark cluster on demand through YARN has higher startup latency than using a standalone cluster, where the JVMs are already initialized and ready. Given that there's a lot more

Re: Spark on Yarn or Mesos?

2014-04-27 Thread Andrew Ash
, at 12:08 PM, Andrew Ash and...@andrewash.com wrote: That thread was mostly about benchmarking YARN vs standalone, and the results are what I'd expect -- spinning up a Spark cluster on demand through YARN has higher startup latency than using a standalone cluster, where the JVMs are already

Re: launching concurrent jobs programmatically

2014-04-28 Thread Andrew Ash
For the second question, you can submit multiple jobs through the same SparkContext via different threads and this is a supported way of interacting with Spark. From the documentation: Second, *within* each Spark application, multiple “jobs” (Spark actions) may be running concurrently if they

Re: Equally weighted partitions in Spark

2014-05-01 Thread Andrew Ash
The problem is that equally-sized partitions take variable time to complete based on their contents? Sent from my mobile phone On May 1, 2014 8:31 AM, deenar.toraskar deenar.toras...@db.com wrote: Hi I am using Spark to distribute computationally intensive tasks across the cluster. Currently

Re: Equally weighted partitions in Spark

2014-05-02 Thread Andrew Ash
Deenar, I haven't heard of any activity to do partitioning in that way, but it does seem more broadly valuable. On Fri, May 2, 2014 at 10:15 AM, deenar.toraskar deenar.toras...@db.comwrote: I have equal sized partitions now, but I want the RDD to be partitioned such that the partitions are

Re: Incredible slow iterative computation

2014-05-02 Thread Andrew Ash
= {}) // Force evaluation oldRdd.unpersist(true) According to my usage pattern i tried to don't unpersist the intermediate RDDs (i.e. oldRdd) but nothing change. Any hints? How could i debug this? 2014-04-14 12:55 GMT+02:00 Andrew Ash and...@andrewash.com: A lot of your time is being

Re: what's local[n]

2014-05-03 Thread Andrew Ash
Hi Weide, The answer to your first question about local[2] can be found in the Running the Examples and Shell section of https://spark.apache.org/docs/latest/ Note that all of the sample programs take a master parameter specifying the cluster URL to connect to. This can be a URL for a

Re: Spark's behavior

2014-05-03 Thread Andrew Ash
Hi Eduardo, Yep those machines look pretty well synchronized at this point. Just wanted to throw that out there and eliminate it as a possible source of confusion. Good luck on continuing the debugging! Andrew On Sat, May 3, 2014 at 11:59 AM, Eduardo Costa Alfaia e.costaalf...@unibs.it

Re: Dead lock running multiple Spark jobs on Mesos

2014-05-13 Thread Andrew Ash
Are you setting a core limit with spark.cores.max? If you don't, in coarse mode each Spark job uses all available cores on Mesos and doesn't let them go until the job is terminated. At which point the other job can access the cores. https://spark.apache.org/docs/latest/running-on-mesos.html --

Re: Spark unit testing best practices

2014-05-14 Thread Andrew Ash
There's an undocumented mode that looks like it simulates a cluster: SparkContext.scala: // Regular expression for simulating a Spark cluster of [N, cores, memory] locally val LOCAL_CLUSTER_REGEX = local-cluster\[\s*([0-9]+)\s*,\s*([0-9]+)\s*,\s*([0-9]+)\s*].r can you running your tests

Re: Reading from .bz2 files with Spark

2014-05-16 Thread Andrew Ash
, 2014 at 9:19 PM, Andrew Ash and...@andrewash.com wrote: Hi all, Is anyone reading and writing to .bz2 files stored in HDFS from Spark with success? I'm finding the following results on a recent commit (756c96 from 24hr ago) and CDH 4.4.0: Works: val r = sc.textFile(/user/aa

Re: What is the difference between a Spark Worker and a Spark Slave?

2014-05-16 Thread Andrew Ash
They are different terminology for the same thing and should be interchangeable. On Fri, May 16, 2014 at 2:02 PM, Robert James srobertja...@gmail.comwrote: What is the difference between a Spark Worker and a Spark Slave?

Re: File list read into single RDD

2014-05-18 Thread Andrew Ash
Spark's sc.textFile()https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L456 method delegates to sc.hadoopFile(), which uses Hadoop's

Re: unsubscribe

2014-05-18 Thread Andrew Ash
Hi Shangyu (and everyone else looking to unsubscribe!), If you'd like to get off this mailing list, please send an email to user *-unsubscribe*@spark.apache.org, not the regular user@spark.apache.org list. How to use the Apache mailing list infrastructure is documented here:

Re: unsubscribe

2014-05-19 Thread Andrew Ash
? I know many people might not like it, but maybe the list messages should have a footer with this administrative info (even if it's just a link to the archive page)? On Sun, May 18, 2014 at 1:49 PM, Andrew Ash and...@andrewash.com wrote: If you'd like to get off this mailing list, please send

Re: How to run the SVM and LogisticRegression

2014-05-19 Thread Andrew Ash
Hi yxzhao, Those are branches in the source code git repository. You can get to them with git checkout branch-1.0 once you've cloned the git repository. Cheers, Andrew On Mon, May 19, 2014 at 8:30 PM, yxzhao yxz...@ualr.edu wrote: Thanks Xiangrui, Sorry I am new for Spark, could

Re: Reading from .bz2 files with Spark

2014-05-19 Thread Andrew Ash
with multiple cores. 2) BZip2 files are big enough or minPartitions is large enough when you load the file via sc.textFile(), so that one worker has more than one tasks. Best, Xiangrui On Fri, May 16, 2014 at 4:06 PM, Andrew Ash and...@andrewash.com wrote: Hi Xiangrui, // FYI I'm

Re: Problem when sorting big file

2014-05-19 Thread Andrew Ash
Is your RDD of Strings? If so, you should make sure to use the Kryo serializer instead of the default Java one. It stores strings as UTF8 rather than Java's default UTF16 representation, which can save you half the memory usage in the right situation. Try setting the persistence level on the

Re: Spark and Hadoop

2014-05-20 Thread Andrew Ash
Hi Puneet, If you're not going to read/write data in HDFS from your Spark cluster, then it doesn't matter which one you download. Just go with Hadoop 2 as that's more likely to connect to an HDFS cluster in the future if you ever do decide to use HDFS because it's the newer APIs. Cheers, Andrew

Re: Spark stalling during shuffle (maybe a memory issue)

2014-05-20 Thread Andrew Ash
If the distribution of the keys in your groupByKey is skewed (some keys appear way more often than others) you should consider modifying your job to use reduceByKey instead wherever possible. On May 20, 2014 12:53 PM, Jon Keebler jkeeble...@gmail.com wrote: So we upped the spark.akka.frameSize

Re: ClassNotFoundException with Spark/Mesos (spark-shell works fine)

2014-05-21 Thread Andrew Ash
Here's the 1.0.0rc9 version of the docs: https://people.apache.org/~pwendell/spark-1.0.0-rc9-docs/running-on-mesos.html I refreshed them with the goal of steering users more towards prebuilt packages than relying on compiling from source plus improving overall formatting and clarity, but not

Re: ExternalAppendOnlyMap: Spilling in-memory map

2014-05-21 Thread Andrew Ash
Hi Mohit, The log line about the ExternalAppendOnlyMap is more of a symptom of slowness than causing slowness itself. The ExternalAppendOnlyMap is used when a shuffle is causing too much data to be held in memory. Rather than OOM'ing, Spark writes the data out to disk in a sorted order and

Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-21 Thread Andrew Ash
One thing you can try is to pull each file out of S3 and decompress with gzip -d to see if it works. I'm guessing there's a corrupted .gz file somewhere in your path glob. Andrew On Wed, May 21, 2014 at 12:40 PM, Michael Cutler mich...@tumra.com wrote: Hi Nick, Which version of Hadoop are

Re: ExternalAppendOnlyMap: Spilling in-memory map

2014-05-22 Thread Andrew Ash
at 8:36 PM, Andrew Ash and...@andrewash.com wrote: Hi Mohit, The log line about the ExternalAppendOnlyMap is more of a symptom of slowness than causing slowness itself. The ExternalAppendOnlyMap is used when a shuffle is causing too much data to be held in memory. Rather than OOM'ing, Spark

Re: Comprehensive Port Configuration reference?

2014-05-23 Thread Andrew Ash
Hi everyone, I've also been interested in better understanding what ports are used where and the direction the network connections go. I've observed a running cluster and read through code, and came up with the below documentation addition. https://github.com/apache/spark/pull/856 Scott and

Re: Computing cosine similiarity using pyspark

2014-05-23 Thread Andrew Ash
Hi Jamal, I don't believe there are pre-written algorithms for Cosine similarity or Pearson Porrelation in PySpark that you can re-use. If you end up writing your own implementation of the algorithm though, the project would definitely appreciate if you shared that code back with the project for

Re: Dead lock running multiple Spark jobs on Mesos

2014-05-25 Thread Andrew Ash
. Martin Am 13.05.2014 08:48, schrieb Andrew Ash: Are you setting a core limit with spark.cores.max? If you don't, in coarse mode each Spark job uses all available cores on Mesos and doesn't let them go until the job is terminated. At which point the other job can access the cores. https

Re: problem about broadcast variable in iteration

2014-05-25 Thread Andrew Ash
Hi Randy, In Spark 1.0 there was a lot of work done to allow unpersisting data that's no longer needed. See the below pull request. Try running kvGlobal.unpersist() on line 11 before the re-broadcast of the next variable to see if you can cut the dependency there.

Re: KryoSerializer Exception

2014-05-25 Thread Andrew Ash
Hi Andrea, What version of Spark are you using? There were some improvements in how Spark uses Kryo in 0.9.1 and to-be 1.0 that I would expect to improve this. Also, can you share your registrator's code? Another possibility is that Kryo can have some difficulty serializing very large objects.

Re: Comprehensive Port Configuration reference?

2014-05-25 Thread Andrew Ash
it aligns! Jacob Jacob D. Eisinger IBM Emerging Technologies jeis...@us.ibm.com - (512) 286-6075 [image: Inactive hide details for Andrew Ash ---05/23/2014 10:30:58 AM---Hi everyone, I've also been interested in better understanding]Andrew Ash ---05/23/2014 10:30:58 AM---Hi everyone, I've also

Re: Working with Avro Generic Records in the interactive scala shell

2014-05-27 Thread Andrew Ash
Also see this context from February. We started working with Chill to get Avro records automatically registered with Kryo. I'm not sure the final status, but from the Chill PR #172 it looks like this might be much less friction than before. Issue we filed:

Re: K-nearest neighbors search in Spark

2014-05-27 Thread Andrew Ash
Hi Carter, In Spark 1.0 there will be an implementation of k-means available as part of MLLib. You can see the documentation for that below (until 1.0 is fully released). https://people.apache.org/~pwendell/spark-1.0.0-rc9-docs/mllib-clustering.html Maybe diving into the source here will help

Re: WebUI's Application count doesn't get updated

2014-06-03 Thread Andrew Ash
Your applications are probably not connecting to your existing cluster and instead running in local mode. Are you passing the master URL to the SparkPi application? Andrew On Tue, Jun 3, 2014 at 12:30 AM, MrAsanjar . afsan...@gmail.com wrote: - HI all, - Application running and

Re: How to create RDDs from another RDD?

2014-06-03 Thread Andrew Ash
current conclusion is that the best option would be to roll an own saveHdfsFile(...) Would you agree? -greetz, Gerard. [1] http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job On Mon, Jun 2, 2014 at 11:44 PM, Andrew Ash and...@andrewash.com wrote

Re: Error related to serialisation in spark streaming

2014-06-03 Thread Andrew Ash
Hi Mayur, is that closure cleaning a JVM issue or a Spark issue? I'm used to thinking of closure cleaner as something Spark built. Do you have somewhere I can read more about this? On Tue, Jun 3, 2014 at 12:47 PM, Mayur Rustagi mayur.rust...@gmail.com wrote: So are you using Java 7 or 8. 7

Re: is there any easier way to define a custom RDD in Java

2014-06-04 Thread Andrew Ash
Just curious, what do you want your custom RDD to do that the normal ones don't? On Wed, Jun 4, 2014 at 6:30 AM, bluejoe2008 bluejoe2...@gmail.com wrote: hi, folks, is there any easier way to define a custom RDD in Java? I am wondering if I have to define a new java class which

Re: Error related to serialisation in spark streaming

2014-06-04 Thread Andrew Ash
nilmish, To confirm your code is using kryo, go to the web ui of your application (defaults to :4040) and look at the environment tab. If your serializer settings are there then things should be working properly. I'm not sure how to confirm that it works against typos in the setting, but you

Re: How to change default storage levels

2014-06-04 Thread Andrew Ash
You can change storage level on an individual RDD with .persist(StorageLevel.MEMORY_AND_DISK), but I don't think you can change what the default persistency level is for RDDs. Andrew On Wed, Jun 4, 2014 at 1:52 AM, Salih Kardan karda...@gmail.com wrote: Hi I'm using Spark 0.9.1 and Shark

Re: Can this be done in map-reduce technique (in parallel)

2014-06-04 Thread Andrew Ash
When you group by IP address in step 1 to this: (ip1,(lat1,lon1),(lat2,lon2)) (ip2,(lat3,lon3),(lat4,lat5)) How many lat/lon locations do you expect for each IP address? avg and max are interesting. Andrew On Wed, Jun 4, 2014 at 5:29 AM, Oleg Proudnikov

Re: Setting executor memory when using spark-shell

2014-06-05 Thread Andrew Ash
Hi Oleg, I set the size of my executors on a standalone cluster when using the shell like this: ./bin/spark-shell --master $MASTER --total-executor-cores $CORES_ACROSS_CLUSTER --driver-java-options -Dspark.executor.memory=$MEMORY_PER_EXECUTOR It doesn't seem particularly clean, but it works.

Re: Setting executor memory when using spark-shell

2014-06-05 Thread Andrew Ash
-Dspark.executor.memory=$MEMORY_PER_EXECUTOR I get bad option: '--driver-java-options' There must be something different in my setup. Any ideas? Thank you again, Oleg On 5 June 2014 22:28, Andrew Ash and...@andrewash.com wrote: Hi Oleg, I set the size of my executors on a standalone cluster when

Re: Join : Giving incorrect result

2014-06-05 Thread Andrew Ash
Hi Ajay, Can you please try running the same code with spark.shuffle.spill=false and see if the numbers turn out correctly? That parameter controls whether or not the buggy code that Matei fixed in ExternalAppendOnlyMap is used. FWIW I saw similar issues in 0.9.0 but no longer in 0.9.1 after I

Re: Using Spark on Data size larger than Memory size

2014-06-05 Thread Andrew Ash
Hi Roger, You should be able to sort within partitions using the rdd.mapPartitions() method, and that shouldn't require holding all data in memory at once. It does require holding the entire partition in memory though. Do you need the partition to never be held in memory all at once? As far as

Re: Comprehensive Port Configuration reference?

2014-06-09 Thread Andrew Ash
Andrew, This is a standalone cluster. And, yes, if my understanding of Spark terminology is correct, you are correct about the port ownerships. Jacob Jacob D. Eisinger IBM Emerging Technologies jeis...@us.ibm.com - (512) 286-6075 [image: Inactive hide details for Andrew Ash ---05/28

Re: Memory footprint of Calliope: Spark - Cassandra writes

2014-06-17 Thread Andrew Ash
Gerard, Strings in particular are very inefficient because they're stored in a two-byte format by the JVM. If you use the Kryo serializer and have use StorageLevel.MEMORY_ONLY_SER then Kryo stores Strings in UTF8, which for ASCII-like strings will take half the space. Andrew On Tue, Jun 17,

Re: Wildcard support in input path

2014-06-17 Thread Andrew Ash
In Spark you can use the normal globs supported by Hadoop's FileSystem, which are documented here: http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/fs/FileSystem.html#globStatus(org.apache.hadoop.fs.Path) On Wed, Jun 18, 2014 at 12:09 AM, MEETHU MATHEW meethu2...@yahoo.co.in wrote:

Re: Spark is now available via Homebrew

2014-06-18 Thread Andrew Ash
What's the advantage of Apache maintaining the brew installer vs users? Apache handling it means more work on this dev team, but probably a better experience for brew users. Just wanted to weigh pros/cons before committing to support this installation method. Andrew On Wed, Jun 18, 2014 at

Re: Spark 0.9.1 java.lang.outOfMemoryError: Java Heap Space

2014-06-18 Thread Andrew Ash
Wait, so the file only has four lines and the job running out of heap space? Can you share the code you're running that does the processing? I'd guess that you're doing some intense processing on every line but just writing parsed case classes back to disk sounds very lightweight. I On Wed,

Re: 1.0.1 release plan

2014-06-20 Thread Andrew Ash
Sounds good. Mingyu and I are waiting on 1.0.1 to get the fix for the below issues without running a patched version of Spark: https://issues.apache.org/jira/browse/SPARK-1935 -- commons-codec version conflicts for client applications https://issues.apache.org/jira/browse/SPARK-2043 --

Re: RDD join: composite keys

2014-07-03 Thread Andrew Ash
Hi Sameer, If you set those two IDs to be a Tuple2 in the key of the RDD, then you can join on that tuple. Example: val rdd1: RDD[Tuple3[Int, Int, String]] = ... val rdd2: RDD[Tuple3[Int, Int, String]] = ... val resultRDD = rdd1.map(k = ((k._1, k._2), k._3)).join( rdd2.map(k =

Re: reading compress lzo files

2014-07-06 Thread Andrew Ash
Ni Nick, The cluster I was working on in those linked messages was a private data center cluster, not on EC2. I'd imagine that the setup would be pretty similar, but I'm not familiar with the EC2 init scripts that Spark uses. Also I upgraded that cluster to 1.0 recently and am continuing to use

Re: hdfs replication on saving RDD

2014-07-15 Thread Andrew Ash
In general it would be nice to be able to configure replication on a per-job basis. Is there a way to do that without changing the config values in the Hadoop conf/ directory between jobs? Maybe by modifying OutputFormats or the JobConf ? On Mon, Jul 14, 2014 at 11:12 PM, Matei Zaharia

Re: How does Spark speculation prevent duplicated work?

2014-07-15 Thread Andrew Ash
Hi Nan, Great digging in -- that makes sense to me for when a job is producing some output handled by Spark like a .count or .distinct or similar. For the other part of the question, I'm also interested in side effects like an HDFS disk write. If one task is writing to an HDFS path and another

Re: How to map each line to (line number, line)?

2014-07-21 Thread Andrew Ash
I'm not sure if you guys ever picked a preferred method for doing this, but I just encountered it and came up with this method that's working reasonably well on a small dataset. It should be quite easily generalizable to non-String RDDs. def addRowNumber(r: RDD[String]): RDD[Tuple2[Long,String]]

Re: Configuring Spark Memory

2014-07-23 Thread Andrew Ash
Hi Martin, In standalone mode, each SparkContext you initialize gets its own set of executors across the cluster. So for example if you have two shells open, they'll each get two JVMs on each worker machine in the cluster. As far as the other docs, you can configure the total number of cores

Re: Spark 0.9.1 - saveAsTextFile() exception: _temporary doesn't exist!

2014-07-30 Thread Andrew Ash
Hi Oleg, Did you ever figure this out? I'm observing the same exception also in 0.9.1 and think it might be related to setting spark.speculation=true. My theory is that multiple attempts at the same task start, the first finishes and cleans up the _temporary directory, and then the second fails

Re: Spark: Could not load native gpl library

2014-08-07 Thread Andrew Ash
Hi Jikai, It looks like you're trying to run a Spark job on data that's stored in HDFS in .lzo format. Spark can handle this (I do it all the time), but you need to configure your Spark installation to know about the .lzo format. There are two parts to the hadoop lzo library -- the first is the

Re: How to use spark-cassandra-connector in spark-shell?

2014-08-07 Thread Andrew Ash
Yes, I've done it before. On Thu, Aug 7, 2014 at 10:18 PM, Gary Zhao garyz...@gmail.com wrote: Hello Is it possible to use spark-cassandra-connector in spark-shell? Thanks Gary

Re: How to use spark-cassandra-connector in spark-shell?

2014-08-07 Thread Andrew Ash
7, 2014 at 10:20 PM, Andrew Ash and...@andrewash.com wrote: Yes, I've done it before. On Thu, Aug 7, 2014 at 10:18 PM, Gary Zhao garyz...@gmail.com wrote: Hello Is it possible to use spark-cassandra-connector in spark-shell? Thanks Gary

Re: saveAsTextFiles file not found exception

2014-08-11 Thread Andrew Ash
I've also been seeing similar stacktraces on Spark core (not streaming) and have a theory it's related to spark.speculation being turned on. Do you have that enabled by chance? On Mon, Aug 11, 2014 at 8:10 AM, Chen Song chen.song...@gmail.com wrote: Bill Did you get this resolved somehow?

Re: saveAsTextFiles file not found exception

2014-08-11 Thread Andrew Ash
:13 AM, Andrew Ash and...@andrewash.com wrote: I've also been seeing similar stacktraces on Spark core (not streaming) and have a theory it's related to spark.speculation being turned on. Do you have that enabled by chance? On Mon, Aug 11, 2014 at 8:10 AM, Chen Song chen.song...@gmail.com

Re: saveAsTextFiles file not found exception

2014-08-12 Thread Andrew Ash
Hi Chen, Please see the bug I filed at https://issues.apache.org/jira/browse/SPARK-2984 with the FileNotFoundException on _temporary directory issue. Andrew On Mon, Aug 11, 2014 at 10:50 PM, Andrew Ash and...@andrewash.com wrote: Not sure which stalled HDFS client issue your'e referring

Re: set SPARK_LOCAL_DIRS issue

2014-08-12 Thread Andrew Ash
// assuming Spark 1.0 Hi Baoqiang, In my experience for the standalone cluster you need to set SPARK_WORKER_DIR not SPARK_LOCAL_DIRS to control where shuffle files are written. I think this is a documentation issue that could be improved, as

Re: SPARK_LOCAL_DIRS option

2014-08-13 Thread Andrew Ash
Hi Deb, If you don't have long-running Spark applications (those taking more than spark.worker.cleanup.appDataTtl) then the TTL-based cleaner is a good solution. If however you have a mix of long-running and short-running applications, then the TTL-based solution will fail. It will clean up

Re: Segmented fold count

2014-08-18 Thread Andrew Ash
What happens when a run of numbers is spread across a partition boundary? I think you might end up with two adjacent groups of the same value in that situation. On Mon, Aug 18, 2014 at 2:05 AM, Davies Liu dav...@databricks.com wrote: import itertools l = [1,1,1,2,2,3,4,4,5,1] gs =

Re: heterogeneous cluster hardware

2014-08-21 Thread Andrew Ash
I'm actually not sure the Spark+Mesos integration supports dynamically allocating memory (it does support dynamically allocating cores though). Has anyone here actually used Spark+Mesos on heterogenous hardware and done dynamic memory allocation? My understanding is that each Spark executor

Re: heterogeneous cluster hardware

2014-08-21 Thread Andrew Ash
/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackend.scala#L114 - where Spark accepts sc.executorMemory of a resource offer, regardless of how much more memory was available On Thu, Aug 21, 2014 at 2:12 PM, Andrew Ash and...@andrewash.com

Re: Understanding RDD.GroupBy OutOfMemory Exceptions

2014-08-25 Thread Andrew Ash
Hi Patrick, For the spilling within on key work you mention might land in Spark 1.2, is that being tracked in https://issues.apache.org/jira/browse/SPARK-1823 or is there another ticket I should be following? Thanks! Andrew On Tue, Aug 5, 2014 at 3:39 PM, Patrick Wendell pwend...@gmail.com

Re: Out of memory on large RDDs

2014-08-26 Thread Andrew Ash
Hi Grega, Did you ever get this figured out? I'm observing the same issue in Spark 1.0.2. For me it was after 1.5hr of a large .distinct call, followed by a .saveAsTextFile() 14/08/26 20:57:43 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 18500 14/08/26 20:57:43 INFO

Re: Multiple spark shell sessions

2014-09-05 Thread Andrew Ash
Hi Dhimant, We also cleaned up these needless warnings on port failover in Spark 1.1 -- see https://issues.apache.org/jira/browse/SPARK-1902 Andrew On Thu, Sep 4, 2014 at 7:38 AM, Dhimant dhimant84.jays...@gmail.com wrote: Thanks Yana, I am able to execute application and command via

Re: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-09-15 Thread Andrew Ash
nicholas.cham...@gmail.com wrote: Andrew, This email was pretty helpful. I feel like this stuff should be summarized in the docs somewhere, or perhaps in a blog post. Do you know if it is? Nick On Thu, Jun 5, 2014 at 6:36 PM, Andrew Ash and...@andrewash.com wrote: The locality

Re: Questions about Spark speculation

2014-09-17 Thread Andrew Ash
Hi Nicolas, I've had suspicions about speculation causing problems on my cluster but don't have any hard evidence of it yet. I'm also interested in why it's turned off by default. On Tue, Sep 16, 2014 at 3:01 PM, Nicolas Mai nicolas@gmail.com wrote: Hi, guys My current project is using

Re: Adjacency List representation in Spark

2014-09-17 Thread Andrew Ash
Hi Harsha, You could look through the GraphX source to see the approach taken there for ideas in your own. I'd recommend starting at https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/Graph.scala#L385 to see the storage technique. Why do you want to avoid

Re: Spark and disk usage.

2014-09-17 Thread Andrew Ash
Hi Burak, Most discussions of checkpointing in the docs is related to Spark streaming. Are you talking about the sparkContext.setCheckpointDir()? What effect does that have? https://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing On Wed, Sep 17, 2014 at 7:44 AM,

Re: Spark and disk usage.

2014-09-17 Thread Andrew Ash
Thanks for the info! Are there performance impacts with writing to HDFS instead of local disk? I'm assuming that's why ALS checkpoints every third iteration instead of every iteration. Also I can imagine that checkpointing should be done every N shuffles instead of every N operations (counting

Re: Spark and disk usage.

2014-09-21 Thread Andrew Ash
in Spark Streaming, and some MLlib algorithms. If you can help with the guide, I think it would be a nice feature to have! Burak - Original Message - From: Andrew Ash and...@andrewash.com To: Burak Yavuz bya...@stanford.edu Cc: Макар Красноперов connector@gmail.com, user user

  1   2   >