Re: CheckpointRDD has different number of partitions than original RDD

2014-04-08 Thread Tathagata Das
Yes, that is correct. If you are executing a Spark program across multiple machines, that you need to use a distributed file system (HDFS API compatible) for reading and writing data. In your case, your setup is across multiple machines. So what is probably happening is that the the RDD data is

How to execute a function from class in distributed jar on each worker node?

2014-04-08 Thread Adnan
Hello, I am running Cloudera 4 node cluster with 1 Master and 3 Slaves. I am connecting with Spark Master from scala using SparkContext. I am trying to execute a simple java function from the distributed jar on every Spark Worker but haven't found a way to communicate with each worker or a Spark

Only TraversableOnce?

2014-04-08 Thread wxhsdp
In my application, data parts inside an RDD partition have ralations. so I need to do some operations beween them. for example RDD T1 has several partitions, each partition has three parts A, B and C. then I transform T1 to T2. after transform, T2 also has three parts D, E and F, D = A+B, E =

Re: Only TraversableOnce?

2014-04-08 Thread Nan Zhu
so, the data structure looks like: D consists of D1, D2, D3 (DX is partition) and DX consists of d1, d2, d3 (dx is the part in your context)? what you want to do is to transform DX to (d1 + d2, d1 + d3, d2 + d3)? Best, -- Nan Zhu On Tuesday, April 8, 2014 at 8:09 AM, wxhsdp wrote:

Re: Only TraversableOnce?

2014-04-08 Thread wxhsdp
yes, how can i do this conveniently? i can use filter, but there will be so many RDDs and it's not concise -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Only-TraversableOnce-tp3873p3875.html Sent from the Apache Spark User List mailing list archive at

Re: Only TraversableOnce?

2014-04-08 Thread Nan Zhu
If that’s the case, I think mapPartition is what you need, but it seems that you have to load the partition into the memory as whole by toArray rdd.mapPartition{D = {val p = D.toArray; ...}} -- Nan Zhu On Tuesday, April 8, 2014 at 8:40 AM, wxhsdp wrote: yes, how can i do this

Re: spark-shell on standalone cluster gives error no mesos in java.library.path

2014-04-08 Thread Christoph Böhm
Forgot to post the solution. I messed up the master URL. In particular, I gave the host (master), not a URL. My bad. The error message is weird, though. Seems like the URL regex matches master for mesos://... No idea about the Java Runtime Environment Error. On Mar 26, 2014, at 3:52 PM,

Re: Only TraversableOnce?

2014-04-08 Thread wxhsdp
thank you for your help! let me have a try Nan Zhu wrote If that’s the case, I think mapPartition is what you need, but it seems that you have to load the partition into the memory as whole by toArray rdd.mapPartition{D = {val p = D.toArray; ...}} -- Nan Zhu On Tuesday, April

Re: AWS Spark-ec2 script with different user

2014-04-08 Thread Marco Costantini
Another thing I didn't mention. The AMI and user used: naturally I've created several of my own AMIs with the following characteristics. None of which worked. 1) Enabling ssh as root as per this guide ( http://blog.tiger-workshop.com/enable-root-access-on-amazon-ec2-instance/). When doing this, I

NPE using saveAsTextFile

2014-04-08 Thread Nick Pentreath
Hi I'm using Spark 0.9.0. When calling saveAsTextFile on a custom hadoop inputformat (loaded with newAPIHadoopRDD), I get the following error below. If I call count, I get the correct count of number of records, so the inputformat is being read correctly... the issue only appears when trying to

Re: AWS Spark-ec2 script with different user

2014-04-08 Thread Marco Costantini
I was able to keep the workaround ...around... by overwriting the generated '/root/.ssh/authorized_keys' file with a known good one, in the '/etc/rc.local' file On Tue, Apr 8, 2014 at 10:12 AM, Marco Costantini silvio.costant...@granatads.com wrote: Another thing I didn't mention. The AMI and

Re: Spark and HBase

2014-04-08 Thread Bin Wang
Hi Flavio, I happened to attend, actually attending the 2014 Apache Conf, I heard a project called Apache Phoenix, which fully leverage HBase and suppose to be 1000x faster than Hive. And it is not memory bounded, in which case sets up a limit for Spark. It is still in the incubating group and

Re: Spark and HBase

2014-04-08 Thread Christopher Nguyen
Flavio, the two are best at two orthogonal use cases, HBase on the transactional side, and Spark on the analytic side. Spark is not intended for row-based random-access updates, while far more flexible and efficient in dataset-scale aggregations and general computations. So yes, you can easily

Re: Spark and HBase

2014-04-08 Thread Flavio Pompermaier
Thanks for the quick reply Bin. Phenix is something I'm going to try for sure but is seems somehow useless if I can use Spark. Probably, as you said, since Phoenix use a dedicated data structure within each HBase Table has a more effective memory usage but if I need to deserialize data stored in a

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
i tried again with latest master, which includes commit below, but ui page still shows nothing on storage tab. koert commit ada310a9d3d5419e101b24d9b41398f609da1ad3 Author: Andrew Or andrewo...@gmail.com Date: Mon Mar 31 23:01:14 2014 -0700 [Hot Fix #42] Persisted RDD disappears on

Re: ui broken in latest 1.0.0

2014-04-08 Thread Xiangrui Meng
That commit did work for me. Could you confirm the following: 1) After you called cache(), did you make any actions like count() or reduce()? If you don't materialize the RDD, it won't show up in the storage tab. 2) Did you run ./make-distribution.sh after you switched to the current master?

RDD creation on HDFS

2014-04-08 Thread gtanguy
I read on the RDD paper (http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf) : For example, an RDD representing an HDFS file has a partition for each block of the file and knows which machines each block is on And that on http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html To minimize

assumption that lib_managed is present

2014-04-08 Thread Koert Kuipers
when i start spark-shell i now see ls: cannot access /usr/local/lib/spark/lib_managed/jars/: No such file or directory we do not package a lib_managed with our spark build (never did). maybe the logic in compute-classpath.sh that searches for datanucleus should check for the existence of

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
note that for a cached rdd in the spark shell it all works fine. but something is going wrong with the spark-shell in our applications that extensively cache and re-use RDDs On Tue, Apr 8, 2014 at 12:33 PM, Koert Kuipers ko...@tresata.com wrote: i tried again with latest master, which includes

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
sorry, i meant to say: note that for a cached rdd in the spark shell it all works fine. but something is going wrong with the SPARK-APPLICATION-UI in our applications that extensively cache and re-use RDDs On Tue, Apr 8, 2014 at 12:55 PM, Koert Kuipers ko...@tresata.com wrote: note that for a

Re: ui broken in latest 1.0.0

2014-04-08 Thread Xiangrui Meng
That commit fixed the exact problem you described. That is why I want to confirm that you switched to the master branch. bin/spark-shell doesn't detect code changes, so you need to run ./make-distribution.sh to re-compile Spark first. -Xiangrui On Tue, Apr 8, 2014 at 9:57 AM, Koert Kuipers

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
yes i call an action after cache, and i can see that the RDDs are fully cached using context.getRDDStorageInfo which we expose via our own api. i did not run make-distribution.sh, we have our own scripts to build a distribution. however if your question is if i correctly deployed the latest

Re: Spark and HBase

2014-04-08 Thread Nicholas Chammas
Just took a quick look at the overview herehttp://phoenix.incubator.apache.org/ and the quick start guide herehttp://phoenix.incubator.apache.org/Phoenix-in-15-minutes-or-less.html . It looks like Apache Phoenix aims to provide flexible SQL access to data, both for transactional and analytic

Re: Pig on Spark

2014-04-08 Thread Mayur Rustagi
Hi Ankit, Thanx for all the work on Pig. Finally got it working. Couple of high level bugs right now: - Getting it working on Spark 0.9.0 - Getting UDF working - Getting generate functionality working - Exhaustive test suite on Spark on Pig are you maintaining a Jira somewhere? I am

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
yes i am definitely using latest On Tue, Apr 8, 2014 at 1:07 PM, Xiangrui Meng men...@gmail.com wrote: That commit fixed the exact problem you described. That is why I want to confirm that you switched to the master branch. bin/spark-shell doesn't detect code changes, so you need to run

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
i put some println statements in BlockManagerUI i have RDDs that are cached in memory. I see this: *** onStageSubmitted ** rddInfo: RDD 2 (2) Storage: StorageLevel(false, false, false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
yet at same time i can see via our own api: storageInfo: { diskSize: 0, memSize: 19944, numCachedPartitions: 1, numPartitions: 1 } On Tue, Apr 8, 2014 at 2:25 PM, Koert Kuipers ko...@tresata.com wrote: i put some println statements in BlockManagerUI

Re: assumption that lib_managed is present

2014-04-08 Thread Aaron Davidson
Yup, sorry about that. This error message should not produce incorrect behavior, but it is annoying. Posted a patch to fix it: https://github.com/apache/spark/pull/361 Thanks for reporting it! On Tue, Apr 8, 2014 at 9:54 AM, Koert Kuipers ko...@tresata.com wrote: when i start spark-shell i

java.io.IOException: Call to dev/17.29.25.4:50070 failed on local exception: java.io.EOFException

2014-04-08 Thread reegs
I am trying to read file from HDFS on Spark Shell and getting error as below. When i create first RDD it works fine but when i try to do count on that RDD, it trows me some connection error. I have single node hdfs setup and on the same machine, i have spark running. Please help. When i run jps

Re: Urgently need help interpreting duration

2014-04-08 Thread Yana Kadiyska
Thank you -- this actually helped a lot. Strangely it appears that the task detail view is not accurate in 0.8 -- that view shows 425ms duration for one of the tasks, but in the driver log I do indeed see Finished TID 125 in 10940ms. On that slow worker I see the following: 14/04/08 18:06:24

Re: java.io.IOException: Call to dev/17.29.25.4:50070 failed on local exception: java.io.EOFException

2014-04-08 Thread reegs
There are couple of issues here which i was able to find out. 1: We should not use web port which we use to access the web UI. I was usong that initially so it was not working. 2: All request should go to Name node and not anything else. 3: By replacing localhost:9000 in the above request, it

Measuring Network Traffic for Spark Job

2014-04-08 Thread yxzhao
Hi All, I want to measure the total network traffic for a Spark Job. But I did not see related information from the log. Does anybody know how to measure it? Thanks very much in advance. -- View this message in context:

ETL for postgres to hadoop

2014-04-08 Thread Manas Kar
Hi All, I have some spatial data in postgres machine. I want to be able to move that data to Hadoop and do some geo-processing. I tried using sqoop to move the data to Hadoop but it complained about the position data(which it says can't recognize) Does anyone have any idea as to

Re: ETL for postgres to hadoop

2014-04-08 Thread andy petrella
Hello Manas, I don't know Sqoop that much but my best guess is that you're probably using Postgis which has specific structures for Geometry and so on. And if you need some spatial operators my gut feeling is that things will be harder ^^ (but a raw import won't need that...). So I did a quick

Spark with SSL?

2014-04-08 Thread kamatsuoka
Can Spark be configured to use SSL for all its network communication? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-with-SSL-tp3916.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
1) at the end of the callback 2) yes we simply expose sc.getRDDStorageInfo to the user via REST 3) yes exactly. we define the RDDs at startup, all of them are cached. from that point on we only do calculations on these cached RDDs. i will add some more println statements for storageStatusList

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
our one cached RDD in this run has id 3 *** onStageSubmitted ** rddInfo: RDD 2 (2) Storage: StorageLevel(false, false, false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0 B; DiskSize: 0.0 B _rddInfoMap: Map(2 - RDD 2

Re: Spark with SSL?

2014-04-08 Thread Andrew Ash
Not that I know of, but it would be great if that was supported. The way I typically handle security now is to put the Spark servers in their own subnet with strict inbound/outbound firewalls. On Tue, Apr 8, 2014 at 1:14 PM, kamatsuoka ken...@gmail.com wrote: Can Spark be configured to use

Re: Measuring Network Traffic for Spark Job

2014-04-08 Thread Andrew Ash
If you set up Spark's metrics reporting to write to the Ganglia backend that will give you a good idea of how much network/disk/CPU is being used and on what machines. https://spark.apache.org/docs/0.9.0/monitoring.html On Tue, Apr 8, 2014 at 12:57 PM, yxzhao yxz...@ualr.edu wrote: Hi All,

Re: Why doesn't the driver node do any work?

2014-04-08 Thread Sean Owen
If you want the machine that hosts the driver to also do work, you can designate it as a worker too, if I'm not mistaken. I don't think the driver should do work, logically, but, that's not to say that the machine it's on shouldn't do work. -- Sean Owen | Director, Data Science | London On Tue,

Re: How to execute a function from class in distributed jar on each worker node?

2014-04-08 Thread Andrew Ash
One thing you could do is create an RDD of [1,2,3] and set a partitioner that puts all three values on their own nodes. Then .foreach() over the RDD and call your function that will run on each node. Why do you need to run the function on every node? Is it some sort of setup code that needs to

Re: Why doesn't the driver node do any work?

2014-04-08 Thread Nicholas Chammas
Alright, so I guess I understand now why spark-ec2 allows you to select different instance types for the driver node and worker nodes. If the driver node is just driving and not doing any large collect()s or heavy processing, it can be much smaller than the worker nodes. With regards to data

Re: Why doesn't the driver node do any work?

2014-04-08 Thread Nan Zhu
may be unrelated to the question itself, just FYI you can run your driver program in worker node with Spark-0.9 http://spark.apache.org/docs/latest/spark-standalone.html#launching-applications-inside-the-cluster Best, -- Nan Zhu On Tuesday, April 8, 2014 at 5:11 PM, Nicholas Chammas

A series of meetups about machine learning with Spark in San Francisco

2014-04-08 Thread DB Tsai
Hi guys, We're going to hold a series of meetups about machine learning with Spark in San Francisco. The first one will be on April 24. Xiangrui Meng from Databricks will talk about Spark, Spark/Python, features engineering, and MLlib. See

Re: [BLOG] For Beginners

2014-04-08 Thread weida xu
Dears, I'm very interested in this. However, the links mentioned above are not accessible from China. Is there any other way to read the two blog pagess? Thanks a lot. 2014-04-08 12:54 GMT+08:00 prabeesh k prabsma...@gmail.com: Hi all, Here I am sharing a blog for beginners, about creating

Re: Measuring Network Traffic for Spark Job

2014-04-08 Thread yxzhao
Thanks Andrew, I will take a look at it. On Tue, Apr 8, 2014 at 3:35 PM, Andrew Ash [via Apache Spark User List] ml-node+s1001560n3920...@n3.nabble.com wrote: If you set up Spark's metrics reporting to write to the Ganglia backend that will give you a good idea of how much network/disk/CPU is

Re: Spark RDD to Shark table IN MEMORY conversion

2014-04-08 Thread abhietc31
Anybody, please help for abov e query. It's challanging but will open new horizon for In-Memory analysis. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-RDD-to-Shark-table-IN-MEMORY-conversion-tp3682p3968.html Sent from the Apache Spark User List

java.io.NotSerializableException exception - custom Accumulator

2014-04-08 Thread Dhimant Jayswal
Hi , I am getting java.io.NotSerializableException exception while executing following program. import org.apache.spark.SparkContext._ import org.apache.spark.SparkContext import org.apache.spark.AccumulatorParam object App { class Vector (val data: Array[Double]) {} implicit object VectorAP

Error when compiling spark in IDEA and best practice to use IDE?

2014-04-08 Thread Dong Mo
Dear list, SBT compiles fine, but when I do the following: sbt/sbt gen-idea import project as SBT project to IDEA 13.1 Make Project and these errors show up: Error:(28, 8) object FileContext is not a member of package org.apache.hadoop.fs import org.apache.hadoop.fs.{FileContext, FileStatus,

Re: Error when compiling spark in IDEA and best practice to use IDE?

2014-04-08 Thread DB Tsai
Hi Dong, This is pretty much what I did. I run into the same issue you have. Since I'm not developing yarn related stuff, I just excluded those two yarn related project from intellji, and it works. PS, you may need to exclude java8 project as well now. Sincerely, DB Tsai

Re: Error when compiling spark in IDEA and best practice to use IDE?

2014-04-08 Thread Sean Owen
I let IntelliJ read the Maven build directly and that works fine. -- Sean Owen | Director, Data Science | London On Wed, Apr 9, 2014 at 6:14 AM, Dong Mo monted...@gmail.com wrote: Dear list, SBT compiles fine, but when I do the following: sbt/sbt gen-idea import project as SBT project to