Re: Checkpoint Vs Cache

2014-05-02 Thread Chris Fregly
http://docs.sigmoidanalytics.com/index.php/Checkpoint_and_not_running_out_of_disk_space On Mon, Apr 14, 2014 at 2:43 AM, Cheng Lian wrote: > Checkpointed RDDs are materialized on disk, while cached RDDs are > materialized in memory. When memory is insufficient, cached RDD blocks (1 > block per

Reading and processing binary format files using spark

2014-05-02 Thread Chengi Liu
Hi, Lets say I have millions of binary format files... Lets say I have this java (or python) library which reads and parses these binary formatted files.. Say import foo f = foo.open(filename) header = f.get_header() and some other methods.. What I was thinking was to write hadoop input format

Re: string to int conversion

2014-05-02 Thread DB Tsai
You can drop header in csv by rddData.mapPartitionsWithIndex((partitionIdx: Int, lines: Iterator[String]) => { if (partitionIdx == 0) { lines.drop(1) } lines } On May 2, 2014 6:02 PM, "SK" wrote: > 1) I have a csv file where one of the field has integer data but it appears > as strings

string to int conversion

2014-05-02 Thread SK
1) I have a csv file where one of the field has integer data but it appears as strings: "1", "3" etc. I tried using toInt to implcitly convert the strings to int after reading (field(3).toInt). But I got a NumberFormatException. So I defined my own conversion as follows, but I still get a NumberFo

Re: YARN issues with resourcemanager.scheduler.address

2014-05-02 Thread zsterone
ok, we figured it out. It is a bit weird, but for some reason, the YARN_CONF_DIR and HADOOP_CONF_DIR did not propagate out. We do see it in the build classpath, but the remote machines don't seem to get it. So we added: export SPARK_YARN_USER_ENV="CLASSPATH=/hadoop/var/hadoop/conf/" and it seem

Crazy Kryo Exception

2014-05-02 Thread Soren Macbeth
Hallo, I've getting this rather crazy kryo exception trying to run my spark job: Exception in thread "main" org.apache.spark.SparkException: Job aborted: Exception while deserializing and fetching task: com.esotericsoftware.kryo.KryoException: java.lang.IllegalArgumentException: Can not set final

spark 0.9.1: ClassNotFoundException

2014-05-02 Thread SK
I am using Spark 0.9.1 in standalone mode. In the SPARK_HOME/examples/src/main/scala/org/apache/spark/ folder, I created my directory called "mycode" in which I have placed some standalone scala code. I was able to compile. I ran the code using: ./bin/run-example org.apache.spark.mycode.MyClass l

Seattle Spark Meetup Slides

2014-05-02 Thread Denny Lee
We’ve had some pretty awesome presentations at the Seattle Spark Meetup - here are the links to the various slides: Seattle Spark Meetup KickOff with DataBricks | Introduction to Spark with Matei Zaharia and Pat McDonough Learnings from Running Spark at Twitter sessions Ben Hindman’s Mesos for

Re: docker image build issue for spark 0.9.1

2014-05-02 Thread Weide Zhang
yes, the docker script is there inside spark source package. It already specifies the master and worker container to run in different docker containers. Mainly it is used for easy deployment and development in my scenario. On Fri, May 2, 2014 at 2:30 PM, Nicholas Chammas wrote: > Don't have a

Re: docker image build issue for spark 0.9.1

2014-05-02 Thread Nicholas Chammas
Don't have any tips for you, Weide, but I was just learning about Docker and it sounds very cool. Are you trying to build both master and worker containers that you can easily deploy to create a cluster? I'm interested in knowing how Docker is used in this case. Nick On Fri, May 2, 2014 at 5:1

docker image build issue for spark 0.9.1

2014-05-02 Thread Weide Zhang
Hi I tried to build docker image for spark 0.9.1 but get the following error. any one has experience resolving the issue ? The following packages have unmet dependencies: tzdata-java : Depends: tzdata (= 2012b-1) but 2013g-0ubuntu0.12.04 is to be installed E: Unable to correct problems, you have

Re: GraphX vertices and connected edges

2014-05-02 Thread Ankur Dave
Do you mean you want to obtain a list of adjacent edges for every vertex? A mapReduceTriplets followed by a join is the right way to do this. The join will be cheap because the original and derived vertices will share indices. There's a built-in function to do this for neighboring vertex propertie

RE: spark-shell driver interacting with Workers in YARN mode - firewall blocking communication

2014-05-02 Thread Jacob Eisinger
Howdy Andrew, I think I am running into the same issue [1] as you. It appears that Spark opens up dynamic / ephemera [2] ports for each job on the shell and the workers. As you are finding out, this makes securing and managing the network for Spark very difficult. > Any idea how to restrict th

Invoke spark-shell without attempting to start the http server

2014-05-02 Thread Stephen Boesch
We have a spark server already running. When invoking spark-shell a new http server is attempted to be started spark.HttpServer: Starting HTTP Server But that attempts results in a BindException due to the preexisting server: java.net.BindException: Address already in use What is the spar

Re: Incredible slow iterative computation

2014-05-02 Thread Andrew Ash
If you end up with a really long dependency tree between RDDs (like 100+) people have reported success with using the .checkpoint() method. This computes the RDD and then saves it, flattening the dependency tree. It turns out that having a really long RDD dependency graph causes serialization siz

RE: spark-shell driver interacting with Workers in YARN mode - firewall blocking communication

2014-05-02 Thread Andrew Lee
Hi Yana, I did. I configured the the port in spark-env.sh, the problem is not the driver port which is fixed.it's the Workers port that are dynamic every time when they are launched in the YARN container. :-( Any idea how to restrict the 'Workers' port range? Date: Fri, 2 May 2014 14:49:23 -040

Re: java.lang.ClassNotFoundException - spark on mesos

2014-05-02 Thread bo...@shopify.com
I have opened a PR for discussion on the apache/spark repository https://github.com/apache/spark/pull/620 There is certainly a classLoader problem in the way Mesos and Spark operate, I'm not sure what caused it to suddenly stop working so I'd like to open the discussion there -- View this messa

RE: another updateStateByKey question

2014-05-02 Thread Adrian Mocanu
Unfortunately, I’ve been able to have this happen only once: the first time I ran my test. Consecutive tests never showed this again. I will test some more and If it happens I will try to get more details. Thanks! -A From: Tathagata Das [mailto:tathagata.das1...@gmail.com] Sent: May-02-14 3:10 P

Re: another updateStateByKey question

2014-05-02 Thread Tathagata Das
Could be a bug. Can you share a code with data that I can use to reproduce this? TD On May 2, 2014 9:49 AM, "Adrian Mocanu" wrote: > Has anyone else noticed that *sometimes* the same tuple calls update > state function twice? > > I have 2 tuples with the same key in 1 RDD part of DStream: RDD[

Re: is it possible to initiate Spark jobs from Oozie?

2014-05-02 Thread Shivani Rao
I have mucked around this a little bit. The first step to make this happen is to build a fat jar. I wrote a quick blogdocumenting my learning curve w.r.t that. The next step is to schedule this as a java action. Since y

Re: spark-shell driver interacting with Workers in YARN mode - firewall blocking communication

2014-05-02 Thread Yana Kadiyska
I think what you want to do is set spark.driver.port to a fixed port. On Fri, May 2, 2014 at 1:52 PM, Andrew Lee wrote: > Hi All, > > I encountered this problem when the firewall is enabled between the > spark-shell and the Workers. > > When I launch spark-shell in yarn-client mode, I notice th

Re: Spark: issues with running a sbt fat jar due to akka dependencies

2014-05-02 Thread Shivani Rao
Hello Stephen, My goal was to run spark on a cluster that already had spark and hadoop installed. So the right thing to do was to remove these dependencies in my spark build. I wrote a blog about it so that it might hel

GraphX vertices and connected edges

2014-05-02 Thread Kyle Ellrott
What is the most efficient way to an RDD of GraphX vertices and their connected edges? Initially I though I could use mapReduceTriplet, but I realized that would neglect vertices that aren't connected to anything Would I have to do a mapReduceTriplet and then do a join with all of the vertices to p

Re: Equally weighted partitions in Spark

2014-05-02 Thread Andrew Ash
Deenar, I haven't heard of any activity to do partitioning in that way, but it does seem more broadly valuable. On Fri, May 2, 2014 at 10:15 AM, deenar.toraskar wrote: > I have equal sized partitions now, but I want the RDD to be partitioned > such > that the partitions are equally weighted by

spark-shell driver interacting with Workers in YARN mode - firewall blocking communication

2014-05-02 Thread Andrew Lee
Hi All, I encountered this problem when the firewall is enabled between the spark-shell and the Workers. When I launch spark-shell in yarn-client mode, I notice that Workers on the YARN containers are trying to talk to the driver (spark-shell), however, the firewall is not opened and caused time

Re: Task not serializable: collect, take

2014-05-02 Thread SK
Thank you very much. Making the trait serializable worked. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Task-not-serializable-collect-take-tp5193p5236.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Opinions stratosphere

2014-05-02 Thread Michael Malak
"looks like Spark outperforms Stratosphere fairly consistently in the experiments" There was one exception the paper noted, which was when memory resources were constrained. In that case, Stratosphere seemed to have degraded more gracefully than Spark, but the author did not explore it deeper.

another updateStateByKey question

2014-05-02 Thread Adrian Mocanu
Has anyone else noticed that sometimes the same tuple calls update state function twice? I have 2 tuples with the same key in 1 RDD part of DStream: RDD[ (a,1), (a,2) ] When the update function is called the first time Seq[V] has data: 1, 2 which is correct: StateClass(3,2, ArrayBuffer(1, 2)) The

Re: Opinions stratosphere

2014-05-02 Thread Philip Ogren
Great reference! I just skimmed through the results without reading much of the methodology - but it looks like Spark outperforms Stratosphere fairly consistently in the experiments. It's too bad the data sources only range from 2GB to 8GB. Who knows if the apparent pattern would extend out

RE: Apache Spark is not building in Mac/Java 8

2014-05-02 Thread N . Venkata Naga Ravi
Thanks Prashant . The 1.0 RC version is working fine in my system. Let me explore further and get back you. Thanks Again, Ravi From: scrapco...@gmail.com Date: Fri, 2 May 2014 16:22:40 +0530 Subject: Re: Apache Spark is not building in Mac/Java 8 To: user@spark.apache.org I have pasted the link

Re: when to use broadcast variables

2014-05-02 Thread Prashant Sharma
I had like to be corrected on this but I am just trying to say small enough of the order of few 100 MBs. Imagine the size gets shipped to all nodes, it can be a GB but not GBs and then depends on the network too. Prashant Sharma On Fri, May 2, 2014 at 6:42 PM, Diana Carroll wrote: > Anyone hav

RE: range partitioner with updateStateByKey

2014-05-02 Thread Adrian Mocanu
I’d like to know both ways: arrival order and sort order -A From: Tathagata Das [mailto:tathagata.das1...@gmail.com] Sent: May-02-14 12:04 AM To: user@spark.apache.org Cc: u...@spark.incubator.apache.org Subject: Re: range partitioner with updateStateByKey Ordered by what? arrival order? sort or

Re: getting an error

2014-05-02 Thread Mayur Rustagi
Can you show your spark web ui on 8080. Mostly your workers are not connected with your masters. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Mon, Apr 28, 2014 at 3:13 PM, Joe L wrote: > Hi, while I was testing an e

Re: Scala Spark / Shark: How to access existing Hive tables in Hortonworks?

2014-05-02 Thread Mayur Rustagi
Shark will communicate with JDBC with Hive *meta *server. Thr is no such thing as Hive server, Hive stores all its data in hadoop hdfs, which is where shark pulls it from. Shark works on nested select queries. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi

when to use broadcast variables

2014-05-02 Thread Diana Carroll
Anyone have any guidance on using a broadcast variable to ship data to workers vs. an RDD? Like, say I'm joining web logs in an RDD with user account data. I could keep the account data in an RDD or if it's "small", a broadcast variable instead. How small is small? Small enough that I know it c

Re: help me

2014-05-02 Thread Mayur Rustagi
Spark would be much faster on process_local instead of node_local. Node_local references data from local harddisk, process_local references data from in-memory thread. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Tue,

Re: How to handle this situation: Huge File Shared by All maps and Each Computer Has one copy?

2014-05-02 Thread Mayur Rustagi
Example ... val pageNames = sc.textFile(“pages.txt”).map(...) val pageMap = pageNames.collect().toMap() val bc = sc.broadcast(pageMap) val visits = sc.textFile(“visits.txt”).map(...) val joined = visits.map(v => (v._1, (bc.value(v._1), v._2))) in this you are looking up pagenames in visits & tran

Re: Spark: issues with running a sbt fat jar due to akka dependencies

2014-05-02 Thread Koert Kuipers
not sure why applying concat to reference. conf didn't work for you. since it simply concatenates the files the key akka.version should be preserved. we had the same situation for a while without issues. On May 1, 2014 8:46 PM, "Shivani Rao" wrote: > Hello Koert, > > That did not work. I specifie

Re: Question regarding doing aggregation over custom partitions

2014-05-02 Thread Mayur Rustagi
You need to first partition the data by the key Use mappartition instead of map. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Fri, May 2, 2014 at 5:33 AM, Arun Swami wrote: > Hi, > > I am a newbie to Spark. I looked

Re: Apache Spark is not building in Mac/Java 8

2014-05-02 Thread Prashant Sharma
I have pasted the link in my previous post. Prashant Sharma On Fri, May 2, 2014 at 4:15 PM, N.Venkata Naga Ravi wrote: > Thanks for your quick replay. > > I tried with fresh installation, it downloads sbt 0.12.4 only (please > check below logs). So it is not working. Can you tell where this 1.0

RE: Apache Spark is not building in Mac/Java 8

2014-05-02 Thread N . Venkata Naga Ravi
Thanks for your quick replay. I tried with fresh installation, it downloads sbt 0.12.4 only (please check below logs). So it is not working. Can you tell where this 1.0 release candidate located which i can try? dhcp-173-39-68-28:spark-0.9.1 neravi$ ./sbt/sbt assembly Attempting to fetch sbt ##

Re: Apache Spark is not building in Mac/Java 8

2014-05-02 Thread Prashant Sharma
you will need to change sbt version to 13.2. I think spark 0.9.1 was released with sbt 13 ? Incase not then it may not work with java 8. Just wait for 1.0 release or give 1.0 release candidate a try ! http://mail-archives.apache.org/mod_mbox/spark-dev/201404.mbox/%3CCABPQxstL6nwTO2H9p8%3DGJh1g2zxO

Re: Incredible slow iterative computation

2014-05-02 Thread Andrea Esposito
Sorry for the very late answer. I carefully follow what you have pointed out and i figure out that the structure used for each record was too big with many small objects. Changing it the memory usage drastically decrease. Despite that i'm still struggling with the behaviour of decreasing performa

Apache Spark is not building in Mac/Java 8

2014-05-02 Thread N . Venkata Naga Ravi
Hi, I am tyring to build Apache Spark with Java 8 in my Mac system ( OS X 10.8.5) , but getting following exception. Please help on resolving it. dhcp-173-39-68-28:spark-0.9.1 neravi$ java -version java version "1.8.0" Java(TM) SE Runtime Environment (build 1.8.0-b132) Java HotSpot(TM) 64-

Fwd: New Spark Meetup Group in London, UK. First meeting 28th May

2014-05-02 Thread Martin Goodson
This is to invite all London-based Spark users to the London Spark Meetup group (http://www.meetup.com/Spark-London/). Our first meeting is on the 28th May: http://www.meetup.com/Spark-London/events/176572432/ Our Kick-Off meeting will feature Sean Owen (Director of Data Science, Cloudera

Re: Equally weighted partitions in Spark

2014-05-02 Thread deenar.toraskar
I have equal sized partitions now, but I want the RDD to be partitioned such that the partitions are equally weighted by some attribute of each RDD element (e.g. size or complexity). I have been looking at the RangePartitioner code and I have come up with something like EquallyWeightedPartitioner

RE: java.lang.ClassNotFoundException

2014-05-02 Thread İbrahim Rıza HALLAÇ
Things I tried and the errors are : String path = "/home/ubuntu/spark-0.9.1/SimpleApp/target/simple-project-1.0-allinone.jar";.. .set(path) $mvn package[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.0.2:compile (default-compile) on project simple-project: Compi

Re: Equally weighted partitions in Spark

2014-05-02 Thread Syed A. Hashmi
You can override the default partitioner with range partitionerwhich distributes data in roughly equal sized partitions. On Thu, May 1, 2014 at 11:14 PM, deenar.toraskar wrote: > Yes > > On a