date:20140408

Re: CheckpointRDD has different number of partitions than original RDD

2014-04-08 Thread Tathagata Das

Yes, that is correct. If you are executing a Spark program across multiple machines, that you need to use a distributed file system (HDFS API compatible) for reading and writing data. In your case, your setup is across multiple machines. So what is probably happening is that the the RDD data is

How to execute a function from class in distributed jar on each worker node?

2014-04-08 Thread Adnan

Hello, I am running Cloudera 4 node cluster with 1 Master and 3 Slaves. I am connecting with Spark Master from scala using SparkContext. I am trying to execute a simple java function from the distributed jar on every Spark Worker but haven't found a way to communicate with each worker or a Spark

Only TraversableOnce?

2014-04-08 Thread wxhsdp

In my application, data parts inside an RDD partition have ralations. so I need to do some operations beween them. for example RDD T1 has several partitions, each partition has three parts A, B and C. then I transform T1 to T2. after transform, T2 also has three parts D, E and F, D = A+B, E =

Re: Only TraversableOnce?

2014-04-08 Thread Nan Zhu

so, the data structure looks like: D consists of D1, D2, D3 (DX is partition) and DX consists of d1, d2, d3 (dx is the part in your context)? what you want to do is to transform DX to (d1 + d2, d1 + d3, d2 + d3)? Best, -- Nan Zhu On Tuesday, April 8, 2014 at 8:09 AM, wxhsdp wrote:

Re: Only TraversableOnce?

2014-04-08 Thread wxhsdp

yes, how can i do this conveniently? i can use filter, but there will be so many RDDs and it's not concise -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Only-TraversableOnce-tp3873p3875.html Sent from the Apache Spark User List mailing list archive at

Re: Only TraversableOnce?

2014-04-08 Thread Nan Zhu

If that’s the case, I think mapPartition is what you need, but it seems that you have to load the partition into the memory as whole by toArray rdd.mapPartition{D = {val p = D.toArray; ...}} -- Nan Zhu On Tuesday, April 8, 2014 at 8:40 AM, wxhsdp wrote: yes, how can i do this

Re: spark-shell on standalone cluster gives error no mesos in java.library.path

2014-04-08 Thread Christoph Böhm

Forgot to post the solution. I messed up the master URL. In particular, I gave the host (master), not a URL. My bad. The error message is weird, though. Seems like the URL regex matches master for mesos://... No idea about the Java Runtime Environment Error. On Mar 26, 2014, at 3:52 PM,

Re: Only TraversableOnce?

2014-04-08 Thread wxhsdp

thank you for your help! let me have a try Nan Zhu wrote If that’s the case, I think mapPartition is what you need, but it seems that you have to load the partition into the memory as whole by toArray rdd.mapPartition{D = {val p = D.toArray; ...}} -- Nan Zhu On Tuesday, April

Re: AWS Spark-ec2 script with different user

2014-04-08 Thread Marco Costantini

Another thing I didn't mention. The AMI and user used: naturally I've created several of my own AMIs with the following characteristics. None of which worked. 1) Enabling ssh as root as per this guide ( http://blog.tiger-workshop.com/enable-root-access-on-amazon-ec2-instance/). When doing this, I

NPE using saveAsTextFile

2014-04-08 Thread Nick Pentreath

Hi I'm using Spark 0.9.0. When calling saveAsTextFile on a custom hadoop inputformat (loaded with newAPIHadoopRDD), I get the following error below. If I call count, I get the correct count of number of records, so the inputformat is being read correctly... the issue only appears when trying to

Re: AWS Spark-ec2 script with different user

2014-04-08 Thread Marco Costantini

I was able to keep the workaround ...around... by overwriting the generated '/root/.ssh/authorized_keys' file with a known good one, in the '/etc/rc.local' file On Tue, Apr 8, 2014 at 10:12 AM, Marco Costantini silvio.costant...@granatads.com wrote: Another thing I didn't mention. The AMI and

Re: Spark and HBase

2014-04-08 Thread Bin Wang

Hi Flavio, I happened to attend, actually attending the 2014 Apache Conf, I heard a project called Apache Phoenix, which fully leverage HBase and suppose to be 1000x faster than Hive. And it is not memory bounded, in which case sets up a limit for Spark. It is still in the incubating group and

Re: Spark and HBase

2014-04-08 Thread Christopher Nguyen

Flavio, the two are best at two orthogonal use cases, HBase on the transactional side, and Spark on the analytic side. Spark is not intended for row-based random-access updates, while far more flexible and efficient in dataset-scale aggregations and general computations. So yes, you can easily

Re: Spark and HBase

2014-04-08 Thread Flavio Pompermaier

Thanks for the quick reply Bin. Phenix is something I'm going to try for sure but is seems somehow useless if I can use Spark. Probably, as you said, since Phoenix use a dedicated data structure within each HBase Table has a more effective memory usage but if I need to deserialize data stored in a

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers

i tried again with latest master, which includes commit below, but ui page still shows nothing on storage tab. koert commit ada310a9d3d5419e101b24d9b41398f609da1ad3 Author: Andrew Or andrewo...@gmail.com Date: Mon Mar 31 23:01:14 2014 -0700 [Hot Fix #42] Persisted RDD disappears on

Re: ui broken in latest 1.0.0

2014-04-08 Thread Xiangrui Meng

That commit did work for me. Could you confirm the following: 1) After you called cache(), did you make any actions like count() or reduce()? If you don't materialize the RDD, it won't show up in the storage tab. 2) Did you run ./make-distribution.sh after you switched to the current master?

RDD creation on HDFS

2014-04-08 Thread gtanguy

I read on the RDD paper (http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf) : For example, an RDD representing an HDFS ﬁle has a partition for each block of the ﬁle and knows which machines each block is on And that on http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html To minimize

assumption that lib_managed is present

2014-04-08 Thread Koert Kuipers

when i start spark-shell i now see ls: cannot access /usr/local/lib/spark/lib_managed/jars/: No such file or directory we do not package a lib_managed with our spark build (never did). maybe the logic in compute-classpath.sh that searches for datanucleus should check for the existence of

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers

note that for a cached rdd in the spark shell it all works fine. but something is going wrong with the spark-shell in our applications that extensively cache and re-use RDDs On Tue, Apr 8, 2014 at 12:33 PM, Koert Kuipers ko...@tresata.com wrote: i tried again with latest master, which includes

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers

sorry, i meant to say: note that for a cached rdd in the spark shell it all works fine. but something is going wrong with the SPARK-APPLICATION-UI in our applications that extensively cache and re-use RDDs On Tue, Apr 8, 2014 at 12:55 PM, Koert Kuipers ko...@tresata.com wrote: note that for a

Re: ui broken in latest 1.0.0

2014-04-08 Thread Xiangrui Meng

That commit fixed the exact problem you described. That is why I want to confirm that you switched to the master branch. bin/spark-shell doesn't detect code changes, so you need to run ./make-distribution.sh to re-compile Spark first. -Xiangrui On Tue, Apr 8, 2014 at 9:57 AM, Koert Kuipers

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers

yes i call an action after cache, and i can see that the RDDs are fully cached using context.getRDDStorageInfo which we expose via our own api. i did not run make-distribution.sh, we have our own scripts to build a distribution. however if your question is if i correctly deployed the latest

Re: Spark and HBase

2014-04-08 Thread Nicholas Chammas

Just took a quick look at the overview herehttp://phoenix.incubator.apache.org/ and the quick start guide herehttp://phoenix.incubator.apache.org/Phoenix-in-15-minutes-or-less.html . It looks like Apache Phoenix aims to provide flexible SQL access to data, both for transactional and analytic

Re: Pig on Spark

2014-04-08 Thread Mayur Rustagi

Hi Ankit, Thanx for all the work on Pig. Finally got it working. Couple of high level bugs right now: - Getting it working on Spark 0.9.0 - Getting UDF working - Getting generate functionality working - Exhaustive test suite on Spark on Pig are you maintaining a Jira somewhere? I am

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers

yes i am definitely using latest On Tue, Apr 8, 2014 at 1:07 PM, Xiangrui Meng men...@gmail.com wrote: That commit fixed the exact problem you described. That is why I want to confirm that you switched to the master branch. bin/spark-shell doesn't detect code changes, so you need to run

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers

i put some println statements in BlockManagerUI i have RDDs that are cached in memory. I see this: *** onStageSubmitted ** rddInfo: RDD 2 (2) Storage: StorageLevel(false, false, false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers

yet at same time i can see via our own api: storageInfo: { diskSize: 0, memSize: 19944, numCachedPartitions: 1, numPartitions: 1 } On Tue, Apr 8, 2014 at 2:25 PM, Koert Kuipers ko...@tresata.com wrote: i put some println statements in BlockManagerUI

Re: assumption that lib_managed is present

2014-04-08 Thread Aaron Davidson

Yup, sorry about that. This error message should not produce incorrect behavior, but it is annoying. Posted a patch to fix it: https://github.com/apache/spark/pull/361 Thanks for reporting it! On Tue, Apr 8, 2014 at 9:54 AM, Koert Kuipers ko...@tresata.com wrote: when i start spark-shell i

java.io.IOException: Call to dev/17.29.25.4:50070 failed on local exception: java.io.EOFException

2014-04-08 Thread reegs

I am trying to read file from HDFS on Spark Shell and getting error as below. When i create first RDD it works fine but when i try to do count on that RDD, it trows me some connection error. I have single node hdfs setup and on the same machine, i have spark running. Please help. When i run jps

Re: Urgently need help interpreting duration

2014-04-08 Thread Yana Kadiyska

Thank you -- this actually helped a lot. Strangely it appears that the task detail view is not accurate in 0.8 -- that view shows 425ms duration for one of the tasks, but in the driver log I do indeed see Finished TID 125 in 10940ms. On that slow worker I see the following: 14/04/08 18:06:24

Re: java.io.IOException: Call to dev/17.29.25.4:50070 failed on local exception: java.io.EOFException

2014-04-08 Thread reegs

There are couple of issues here which i was able to find out. 1: We should not use web port which we use to access the web UI. I was usong that initially so it was not working. 2: All request should go to Name node and not anything else. 3: By replacing localhost:9000 in the above request, it

Measuring Network Traffic for Spark Job

2014-04-08 Thread yxzhao

Hi All, I want to measure the total network traffic for a Spark Job. But I did not see related information from the log. Does anybody know how to measure it? Thanks very much in advance. -- View this message in context:

ETL for postgres to hadoop

2014-04-08 Thread Manas Kar

Hi All, I have some spatial data in postgres machine. I want to be able to move that data to Hadoop and do some geo-processing. I tried using sqoop to move the data to Hadoop but it complained about the position data(which it says can't recognize) Does anyone have any idea as to

Re: ETL for postgres to hadoop

2014-04-08 Thread andy petrella

Hello Manas, I don't know Sqoop that much but my best guess is that you're probably using Postgis which has specific structures for Geometry and so on. And if you need some spatial operators my gut feeling is that things will be harder ^^ (but a raw import won't need that...). So I did a quick

Spark with SSL?

2014-04-08 Thread kamatsuoka

Can Spark be configured to use SSL for all its network communication? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-with-SSL-tp3916.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers

1) at the end of the callback 2) yes we simply expose sc.getRDDStorageInfo to the user via REST 3) yes exactly. we define the RDDs at startup, all of them are cached. from that point on we only do calculations on these cached RDDs. i will add some more println statements for storageStatusList

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers

our one cached RDD in this run has id 3 *** onStageSubmitted ** rddInfo: RDD 2 (2) Storage: StorageLevel(false, false, false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0 B; DiskSize: 0.0 B _rddInfoMap: Map(2 - RDD 2

Re: Spark with SSL?

2014-04-08 Thread Andrew Ash

Not that I know of, but it would be great if that was supported. The way I typically handle security now is to put the Spark servers in their own subnet with strict inbound/outbound firewalls. On Tue, Apr 8, 2014 at 1:14 PM, kamatsuoka ken...@gmail.com wrote: Can Spark be configured to use

Re: Measuring Network Traffic for Spark Job

2014-04-08 Thread Andrew Ash

If you set up Spark's metrics reporting to write to the Ganglia backend that will give you a good idea of how much network/disk/CPU is being used and on what machines. https://spark.apache.org/docs/0.9.0/monitoring.html On Tue, Apr 8, 2014 at 12:57 PM, yxzhao yxz...@ualr.edu wrote: Hi All,

Re: Why doesn't the driver node do any work?

2014-04-08 Thread Sean Owen

If you want the machine that hosts the driver to also do work, you can designate it as a worker too, if I'm not mistaken. I don't think the driver should do work, logically, but, that's not to say that the machine it's on shouldn't do work. -- Sean Owen | Director, Data Science | London On Tue,

Re: How to execute a function from class in distributed jar on each worker node?

2014-04-08 Thread Andrew Ash

One thing you could do is create an RDD of [1,2,3] and set a partitioner that puts all three values on their own nodes. Then .foreach() over the RDD and call your function that will run on each node. Why do you need to run the function on every node? Is it some sort of setup code that needs to

Re: Why doesn't the driver node do any work?

2014-04-08 Thread Nicholas Chammas

Alright, so I guess I understand now why spark-ec2 allows you to select different instance types for the driver node and worker nodes. If the driver node is just driving and not doing any large collect()s or heavy processing, it can be much smaller than the worker nodes. With regards to data

Re: Why doesn't the driver node do any work?

2014-04-08 Thread Nan Zhu

may be unrelated to the question itself, just FYI you can run your driver program in worker node with Spark-0.9 http://spark.apache.org/docs/latest/spark-standalone.html#launching-applications-inside-the-cluster Best, -- Nan Zhu On Tuesday, April 8, 2014 at 5:11 PM, Nicholas Chammas

A series of meetups about machine learning with Spark in San Francisco

2014-04-08 Thread DB Tsai

Hi guys, We're going to hold a series of meetups about machine learning with Spark in San Francisco. The first one will be on April 24. Xiangrui Meng from Databricks will talk about Spark, Spark/Python, features engineering, and MLlib. See

Re: [BLOG] For Beginners

2014-04-08 Thread weida xu

Dears, I'm very interested in this. However, the links mentioned above are not accessible from China. Is there any other way to read the two blog pagess? Thanks a lot. 2014-04-08 12:54 GMT+08:00 prabeesh k prabsma...@gmail.com: Hi all, Here I am sharing a blog for beginners, about creating

Re: Measuring Network Traffic for Spark Job

2014-04-08 Thread yxzhao

Thanks Andrew, I will take a look at it. On Tue, Apr 8, 2014 at 3:35 PM, Andrew Ash [via Apache Spark User List] ml-node+s1001560n3920...@n3.nabble.com wrote: If you set up Spark's metrics reporting to write to the Ganglia backend that will give you a good idea of how much network/disk/CPU is

Re: Spark RDD to Shark table IN MEMORY conversion

2014-04-08 Thread abhietc31

Anybody, please help for abov e query. It's challanging but will open new horizon for In-Memory analysis. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-RDD-to-Shark-table-IN-MEMORY-conversion-tp3682p3968.html Sent from the Apache Spark User List

java.io.NotSerializableException exception - custom Accumulator

2014-04-08 Thread Dhimant Jayswal

Hi , I am getting java.io.NotSerializableException exception while executing following program. import org.apache.spark.SparkContext._ import org.apache.spark.SparkContext import org.apache.spark.AccumulatorParam object App { class Vector (val data: Array[Double]) {} implicit object VectorAP

Error when compiling spark in IDEA and best practice to use IDE?

2014-04-08 Thread Dong Mo

Dear list, SBT compiles fine, but when I do the following: sbt/sbt gen-idea import project as SBT project to IDEA 13.1 Make Project and these errors show up: Error:(28, 8) object FileContext is not a member of package org.apache.hadoop.fs import org.apache.hadoop.fs.{FileContext, FileStatus,

Re: Error when compiling spark in IDEA and best practice to use IDE?

2014-04-08 Thread DB Tsai

Hi Dong, This is pretty much what I did. I run into the same issue you have. Since I'm not developing yarn related stuff, I just excluded those two yarn related project from intellji, and it works. PS, you may need to exclude java8 project as well now. Sincerely, DB Tsai

Re: Error when compiling spark in IDEA and best practice to use IDE?

2014-04-08 Thread Sean Owen

I let IntelliJ read the Maven build directly and that works fine. -- Sean Owen | Director, Data Science | London On Wed, Apr 9, 2014 at 6:14 AM, Dong Mo monted...@gmail.com wrote: Dear list, SBT compiles fine, but when I do the following: sbt/sbt gen-idea import project as SBT project to

51 matches

Mail list logo