Re: rdd.saveAsTextFile blows up

2014-07-25 Thread Akhil Das
Most likely you are closing the connection with HDFS. Can you paste the piece of code that you are executing? We were having similar problem when we closed the FileSystem object in our code. Thanks Best Regards On Thu, Jul 24, 2014 at 11:00 PM, Eric Friedman eric.d.fried...@gmail.com wrote:

Re: rdd.saveAsTextFile blows up

2014-07-25 Thread Eric Friedman
I ported the same code to scala. No problems. But in pyspark, this fails consistently: ctx = SQLContext(sc) pf = ctx.parquetFile(...) rdd = pf.map(lambda x: x) crdd = ctx.inferSchema(rdd) crdd.saveAsParquetFile(...) If I do rdd = sc.parallelize([hello, world]) rdd.saveAsTextFile(...) It works.

Re: cache changes precision

2014-07-25 Thread Ron Gonzalez
Cool I'll take a look and give it a try! Thanks, Ron Sent from my iPad On Jul 24, 2014, at 10:35 PM, Andrew Ash and...@andrewash.com wrote: Hi Ron, I think you're encountering the issue where cacheing data from Hadoop ends up with many duplicate values instead of what you expect. Try

Re: Down-scaling Spark on EC2 cluster

2014-07-25 Thread Shubhabrata
Any idea about the probable dates for this implementation. I believe it would be a wonderful (and essential) functionality to gain more acceptance in the community. -- View this message in context:

Re: EOFException when I list all files in hdfs directory

2014-07-25 Thread Akhil Das
Try without the * val avroRdd = sc.newAPIHadoopFile(hdfs://url:8020/my dir/, classOf[AvroSequenceFileInputFormat[AvroKey[GenericRecord],NullWritable]], classOf[AvroKey[GenericRecord]], classOf[NullWritable]) avroRdd.collect() Thanks Best Regards On Fri, Jul 25, 2014 at 7:22 PM, Sparky

Re: EOFException when I list all files in hdfs directory

2014-07-25 Thread Sparky
Thanks for the suggestion. I can confirm that my problem is I have files with zero bytes. It's a known bug and is marked as a high priority: https://issues.apache.org/jira/browse/SPARK-1960 -- View this message in context:

Re: EOFException when I list all files in hdfs directory

2014-07-25 Thread Bertrand Dechoux
Well, anyone can open an account on apache jira and post a new ticket/enhancement/issue/bug... Bertrand Dechoux On Fri, Jul 25, 2014 at 4:07 PM, Sparky gullo_tho...@bah.com wrote: Thanks for the suggestion. I can confirm that my problem is I have files with zero bytes. It's a known bug and

How to pass additional options to Mesos when submitting job?

2014-07-25 Thread Krisztián Szűcs
Hi, We’re trying to use Docker containerization within Mesos via Deimos. We’re submitting Spark jobs from localhost to our cluster. We’ve managed it to work (with fix deimos configuration), but we have issues with passing some options (like job dependent container image) in TaskInfo to Mesos

NMF implementaion is Spark

2014-07-25 Thread Aureliano Buendia
Hi, Is there an implementation for Nonnegative Matrix Factorization in Spark? I understand that MLlib comes with matrix factorization, but it does not seem to cover the nonnegative case.

Re: Strange exception on coalesce()

2014-07-25 Thread Sean Owen
I'm pretty sure this was already fixed last week in SPARK-2414: https://github.com/apache/spark/commit/7c23c0dc3ed721c95690fc49f435d9de6952523c On Fri, Jul 25, 2014 at 1:34 PM, innowireless TaeYun Kim taeyun@innowireless.co.kr wrote: Hi, I'm using Spark 1.0.0. On filter() - map() -

Re: mapToPair vs flatMapToPair vs flatMap function usage.

2014-07-25 Thread Daniel Siegmann
The map and flatMap methods have a similar purpose, but map is 1 to 1, while flatMap is 1 to 0-N (outputting 0 is similar to a filter, except of course it could be outputting a different type). On Thu, Jul 24, 2014 at 6:41 PM, abhiguruvayya sharath.abhis...@gmail.com wrote: Can any one help me

Support for Percentile and Variance Aggregation functions in Spark with HiveContext

2014-07-25 Thread vinay . kashyap
Hi all, I am using Spark 1.0.0 with CDH 5.1.0. I want to aggregate the data in a raw table using a simple query like below SELECT MIN(field1), MAX(field2), AVG(field3), PERCENTILE(field4), year,month,day FROM  raw_data_table  GROUP BY year, month, day MIN, MAX and AVG functions work fine for

Initial job has not accepted any resources (but workers are in UI)

2014-07-25 Thread Ed Sweeney
Hi all, Amazon Linux, AWS, Spark 1.0.1 reading a file. The UI shows there are workers and shows this app context with the 2 tasks waiting. All the hostnames resolve properly so I am guessing the message is correct and that the workers won't accept the job for mem reasons. What params do I

Re: Down-scaling Spark on EC2 cluster

2014-07-25 Thread Nicholas Chammas
No idea. Right now implementing this is up for grabs by the community. On Fri, Jul 25, 2014 at 5:40 AM, Shubhabrata mail2shu...@gmail.com wrote: Any idea about the probable dates for this implementation. I believe it would be a wonderful (and essential) functionality to gain more acceptance

Re: Hadoop client protocol mismatch with spark 1.0.1, cdh3u5

2014-07-25 Thread Bharath Ravi Kumar
Any suggestions to work around this issue ? The pre built spark binaries don't appear to work against cdh as documented, unless there's a build issue, which seems unlikely. On 25-Jul-2014 3:42 pm, Bharath Ravi Kumar reachb...@gmail.com wrote: I'm encountering a hadoop client protocol mismatch

Re: NMF implementaion is Spark

2014-07-25 Thread Xiangrui Meng
It is ALS with setNonnegative. -Xiangrui On Fri, Jul 25, 2014 at 7:38 AM, Aureliano Buendia buendia...@gmail.com wrote: Hi, Is there an implementation for Nonnegative Matrix Factorization in Spark? I understand that MLlib comes with matrix factorization, but it does not seem to cover the

Re: Spark got stuck with a loop

2014-07-25 Thread Denis RP
Anyone can help? I'm using spark 1.0.1 I'm confusing that if the block is found, why no non-empty blocks is got, and the process keeps going forever? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-got-stuck-with-a-loop-tp10590p10663.html

sharing spark context among machines

2014-07-25 Thread myxjtu
Is it possible now to share spark context among machines (through serialization or some other ways)? I am looking for possible ways to make the spark job submission to be HA (high availability). For example, if a job submitted to machine A fails in the middle (due to machine A crash), I want this

Re: Hadoop client protocol mismatch with spark 1.0.1, cdh3u5

2014-07-25 Thread Sean Owen
This indicates your app is not actually using the version of the HDFS client you think. You built Spark from source with the right deps it seems, but are you sure you linked to your build in your app? On Fri, Jul 25, 2014 at 4:32 PM, Bharath Ravi Kumar reachb...@gmail.com wrote: Any suggestions

Re: Questions about disk IOs

2014-07-25 Thread Xiangrui Meng
How many partitions did you use and how many CPU cores in total? The former shouldn't be much larger than the latter. Could you also check the shuffle size from the WebUI? -Xiangrui On Fri, Jul 25, 2014 at 4:10 AM, Charles Li littlee1...@gmail.com wrote: Hi Xiangrui, Thanks for your

Re: Questions about disk IOs

2014-07-25 Thread Charles Li
Hi Xiangrui, I have 16 * 40 cpu cores in total. But I am only using 200 partitions on the 200 executors. I use coalesce without shuffle to reduce the default partition of RDD. The shuffle size from the WebUI is nearly 100m. On Jul 25, 2014, at 23:51, Xiangrui Meng men...@gmail.com wrote:

Issue submitting spark job to yarn

2014-07-25 Thread Ron Gonzalez
Folks,   I've been able to submit simple jobs to yarn thus far. However, when I did something more complicated that added 194 dependency jars using --addJars, the job fails in YARN with no logs. What ends up happening is that no container logs get created (app master or executor). If I add just

Re: Initial job has not accepted any resources (but workers are in UI)

2014-07-25 Thread Navicore
thx for the reply, the UI says my application has cores and mem ID NameCores Memory per Node Submitted Time UserState Duration app-20140725164107-0001 SectionsAndSeamsPipeline6 512.0 MB 2014/07/25 16:41:07tercel RUNNING 21 s -- View this

Re: Hadoop client protocol mismatch with spark 1.0.1, cdh3u5

2014-07-25 Thread Bharath Ravi Kumar
Thanks for responding. I used the pre built spark binaries meant for hadoop1,cdh3u5. I do not intend to build spark against a specific distribution. Irrespective of whether I build my app with the explicit cdh hadoop client dependency, I get the same error message. I also verified that my app's

Re: Hadoop client protocol mismatch with spark 1.0.1, cdh3u5

2014-07-25 Thread Sean Owen
If you link against the pre-built binary, that's for Hadoop 1.0.4. Can you show your deps to clarify what you are depending on? Building custom Spark and depending on it is a different thing from depending on plain Spark and changing its deps. I think you want the latter. On Fri, Jul 25, 2014 at

Using Spark Streaming with Kafka 0.7.2

2014-07-25 Thread maddenpj
Hi all, Currently we have Kafka 0.7.2 running in production and can't upgrade for external reasons however spark streaming (1.0.1) was built with Kafka 0.8.0. What is the best way to use spark streaming with older versions of Kafka. Currently I'm investigating trying to build spark streaming

sparkcontext stop and then start again

2014-07-25 Thread Mohit Jaggi
Folks, I had some pyspark code which used to hang with no useful debug logs. It got fixed when I changed my code to keep the sparkcontext forever instead of stopping it and then creating another one later. Is this a bug or expected behavior? Mohit.

Re: Are all transformations lazy?

2014-07-25 Thread Rico
It may be confusing at first but there is also an important difference between reduce and reduceByKey operations. reduce is an action on an RDD. Hence, it will request the evaluation of transformations that resulted to the RDD. In contrast, reduceByKey is a transformation on PairRDDs, not an

Re: Caching issue with msg: RDD block could not be dropped from memory as it does not exist

2014-07-25 Thread Rico
I could find out the issue. In fact, I did not realize before that when loaded into memory, the data is deserialized. As a result, what seems to be a 21Gb dataset occupies 77Gb in memory. Details about this is clearly explained in the guide on serialization and memory tuning

Re: Decision tree classifier in MLlib

2014-07-25 Thread SK
yes, the output is continuous. So I used a threshold to get binary labels. If prediction threshold, then class is 0 else 1. I use this binary label to then compute the accuracy. Even with this binary transformation, the accuracy with decision tree model is low compared to LR or SVM (for the

Re: memory leak query

2014-07-25 Thread Rico
Hi Michael, I have similar question http://apache-spark-user-list.1001560.n3.nabble.com/Caching-issue-with-msg-RDD-block-could-not-be-dropped-from-memory-as-it-does-not-exist-td10248.html#a10677 before. My problem was that my data was too large to be cached in memory because of

Re: Decision tree classifier in MLlib

2014-07-25 Thread Evan R. Sparks
Can you share the dataset via a gist or something and we can take a look at what's going on? On Fri, Jul 25, 2014 at 10:51 AM, SK skrishna...@gmail.com wrote: yes, the output is continuous. So I used a threshold to get binary labels. If prediction threshold, then class is 0 else 1. I use

Re: Hadoop client protocol mismatch with spark 1.0.1, cdh3u5

2014-07-25 Thread Bharath Ravi Kumar
That's right, I'm looking to depend on spark in general and change only the hadoop client deps. The spark master and slaves use the spark-1.0.1-bin-hadoop1 binaries from the downloads page. The relevant snippet from the app's maven pom is as follows: dependency

Spark SQL and Hive tables

2014-07-25 Thread Sameer Tilak
Hi All,I am trying to load data from Hive tables using Spark SQL. I am using spark-shell. Here is what I see: val trainingDataTable = sql(SELECT prod.prod_num, demographics.gender, demographics.birth_year, demographics.income_group FROM prod p JOIN demographics d ON d.user_id = p.user_id)

RE: Spark SQL and Hive tables

2014-07-25 Thread Sameer Tilak
Hi Jerry,Thanks for your reply. I was following the steps in this programming guide. It does not mention anything about creating HiveContext or HQL explicitly. http://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html Users(userId INT, name String, email

RE: Spark SQL and Hive tables

2014-07-25 Thread Sameer Tilak
Thanks, Jerry. Date: Fri, 25 Jul 2014 17:48:27 -0400 Subject: Re: Spark SQL and Hive tables From: chiling...@gmail.com To: user@spark.apache.org Hi Sameer, The blog post you referred to is about Spark SQL. I don't think the intent of the article is meant to guide you how to read data from Hive

RE: Need help, got java.lang.ExceptionInInitializerError in Yarn-Client/Cluster mode

2014-07-25 Thread Andrew Lee
Hi Jianshi, Could you provide which HBase version you're using? By the way, a quick sanity check on whether the Workers can access HBase? Were you able to manually write one record to HBase with the serialize function? Hardcode and test it ? From: jianshi.hu...@gmail.com Date: Fri, 25 Jul 2014

Re: Initial job has not accepted any resources (but workers are in UI)

2014-07-25 Thread Navicore
solution: opened all ports on the ec2 machine that the driver was running on. need to narrow down what ports akka wants... but the issue is solved. -- View this message in context:

Re: spark streaming actor receiver doesn't play well with kryoserializer

2014-07-25 Thread Alan Ngai
The stack trace was from running the Actor count sample directly, without a spark cluster, so I guess the logs would be from both? I enabled more logging and got this stack trace 4/07/25 17:55:26 [INFO] SecurityManager: Changing view acls to: alan 14/07/25 17:55:26 [INFO] SecurityManager:

Re: Kryo Issue on Spark 1.0.1, Mesos 0.18.2

2014-07-25 Thread Gary Malouf
Maybe this is me misunderstanding the Spark system property behavior, but I'm not clear why the class being loaded ends up having '/' rather than '.' in it's fully qualified name. When I tested this out locally, the '/' were preventing the class from being loaded. On Fri, Jul 25, 2014 at 2:27