Re: Bug in Accumulators...

2014-11-23 Thread Aaron Davidson
As Mohit said, making Main extend Serializable should fix this example. In general, it's not a bad idea to mark the fields you don't want to serialize (e.g., sc and conf in this case) as @transient as well, though this is not the issue in this case. Note that this problem would not have arisen in

Re: Bug in Accumulators...

2014-11-23 Thread Sean Owen
Here, the Main object is not meant to be serialized. transient ought to be for fields that are within an object that is legitimately supposed to be serialized, but, whose value can be recreated on deserialization. I feel like marking objects that aren't logically Serializable as such is a hack,

Re: Spark serialization issues with third-party libraries

2014-11-23 Thread jatinpreet
Thanks Sean, I was actually using instances created elsewhere inside my RDD transformations which as I understand is against Spark programming model. I was referred to a talk about UIMA and Spark integration from this year's Spark summit, which had a workaround for this problem. I just had to make

Re: Spark or MR, Scala or Java?

2014-11-23 Thread Sanjay Subramanian
I am a newbie as well to Spark. Been Hadoop/Hive/Oozie programming extensively before this. I use Hadoop(Java MR code)/Hive/Impala/Presto on a daily basis. To get me jumpstarted into Spark I started this gitHub where there is IntelliJ-ready-To-run code (simple examples of jon, sparksql etc) and

Spark SQL Programming Guide - registerTempTable Error

2014-11-23 Thread riginos
Hi guys , Im trying to do the Spark SQL Programming Guide but after the: case class Person(name: String, age: Int) // Create an RDD of Person objects and register it as a table. val people = sc.textFile(examples/src/main/resources/people.txt).map(_.split(,)).map(p = Person(p(0), p(1).trim.toInt))

Re: Spark or MR, Scala or Java?

2014-11-23 Thread Ashish Rangole
This being a very broad topic, a discussion can quickly get subjective. I'll try not to deviate from my experiences and observations to keep this thread useful to those looking for answers. I have used Hadoop MR (with Hive, MR Java apis, Cascading and Scalding) as well as Spark (since v 0.6) in

Spark Streaming with Python

2014-11-23 Thread Venkat, Ankam
I am trying to run network_wordcount.py example mentioned at https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/network_wordcount.py on CDH5.2 Quickstart VM. Getting below error. Traceback (most recent call last): File

Re: Spark SQL Programming Guide - registerTempTable Error

2014-11-23 Thread Denny Lee
By any chance are you using Spark 1.0.2? registerTempTable was introduced from Spark 1.1+ while for Spark 1.0.2, it would be registerAsTable. On Sun Nov 23 2014 at 10:59:48 AM riginos samarasrigi...@gmail.com wrote: Hi guys , Im trying to do the Spark SQL Programming Guide but after the:

Re: Spark SQL Programming Guide - registerTempTable Error

2014-11-23 Thread riginos
That was the problem ! Thank you Denny for your fast response! Another quick question: Is there any way to update spark to 1.1.0 fast? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Programming-Guide-registerTempTable-Error-tp19591p19595.html

Python Logistic Regression error

2014-11-23 Thread Venkat, Ankam
Can you please suggest sample data for running the logistic_regression.py? I am trying to use a sample data file at https://github.com/apache/spark/blob/master/data/mllib/sample_linear_regression_data.txt I am running this on CDH5.2 Quickstart VM. [cloudera@quickstart mllib]$ spark-submit

Re: Spark or MR, Scala or Java?

2014-11-23 Thread Ognen Duzlevski
On Sun, Nov 23, 2014 at 1:03 PM, Ashish Rangole arang...@gmail.com wrote: Java or Scala : I knew Java already yet I learnt Scala when I came across Spark. As others have said, you can get started with a little bit of Scala and learn more as you progress. Once you have started using Scala for a

Re: Spark SQL Programming Guide - registerTempTable Error

2014-11-23 Thread Denny Lee
It sort of depends on your environment. If you are running on your local environment, I would just download the latest Spark 1.1 binaries and you'll be good to go. If its a production environment, it sort of depends on how you are setup (e.g. AWS, Cloudera, etc.) On Sun Nov 23 2014 at 11:27:49

Converting a column to a map

2014-11-23 Thread Daniel Haviv
Hi, I have a column in my schemaRDD that is a map but I'm unable to convert it to a map.. I've tried converting it to a Tuple2[String,String]: val converted = jsonFiles.map(line= { line(10).asInstanceOf[Tuple2[String,String]]}) but I get ClassCastException: 14/11/23 11:51:30 WARN

Creating a front-end for output from Spark/PySpark

2014-11-23 Thread Alaa Ali
Hello. Okay, so I'm working on a project to run analytic processing using Spark or PySpark. Right now, I connect to the shell and execute my commands. The very first part of my commands is: create an SQL JDBC connection and cursor to pull from Apache Phoenix, do some processing on the returned

Re: Error when Spark streaming consumes from Kafka

2014-11-23 Thread Bill Jay
Hi Dibyendu, Thank you for answer. I will try the Spark-Kafka consumer. Bill On Sat, Nov 22, 2014 at 9:15 PM, Dibyendu Bhattacharya dibyendu.bhattach...@gmail.com wrote: I believe this is something to do with how Kafka High Level API manages consumers within a Consumer group and how it

Re: Creating a front-end for output from Spark/PySpark

2014-11-23 Thread Alex Kamil
Alaa, one option is to use Spark as a cache, importing subset of data from hbase/phoenix that fits in memory, and using jdbcrdd to get more data on cache miss. The front end can be created with pyspark and flusk, either as rest api translating json requests to sparkSQL dialect, or simply

How to insert complex types like mapstring,mapstring,int in spark sql

2014-11-23 Thread critikaled
Hi, I am trying to insert particular set of data from rdd to a hive table I have Map[String,Map[String,Int]] in scala which I want to insert into the table of mapstring,maplt;string,int I was able to create the table but while inserting it says scala.MatchError:

How to keep a local variable in each cluster?

2014-11-23 Thread zh8788
Hi, I am new to spark. This is the first time I am posting here. Currently, I try to implement ADMM optimization algorithms for Lasso/SVM Then I come across a problem: Since the training data(label, feature) is large, so I created a RDD and cached the training data(label, feature ) in memory.

Re: Execute Spark programs from local machine on Yarn-hadoop cluster

2014-11-23 Thread Matt Narrell
I think this IS possible? You must set the HADOOP_CONF_DIR variable on the machine you’re running the Java process that creates the SparkContext. The Hadoop configuration specifies the YARN ResourceManager IPs, and Spark will use that configuration. mn On Nov 21, 2014, at 8:10 AM, Prannoy

wholeTextFiles on 20 nodes

2014-11-23 Thread Simon Hafner
I have 20 nodes via EC2 and an application that reads the data via wholeTextFiles. I've tried to copy the data into hadoop via copyFromLocal, and I get 14/11/24 02:00:07 INFO hdfs.DFSClient: Exception in createBlockOutputStream 172.31.2.209:50010 java.io.IOException: Bad connect ack with

RE: SparkSQL Timestamp query failure

2014-11-23 Thread Wang, Daoyuan
Hi, I think you can try cast(l.timestamp as string)='2012-10-08 16:10:36.0' Thanks, Daoyuan -Original Message- From: whitebread [mailto:ale.panebia...@me.com] Sent: Sunday, November 23, 2014 12:11 AM To: u...@spark.incubator.apache.org Subject: Re: SparkSQL Timestamp query failure

Re: Lots of small input files

2014-11-23 Thread Shixiong Zhu
We encountered similar problem. If all partitions are located in the same node and all of the tasks run less than 3 seconds (set by spark.locality.wait, the default value is 3000), the tasks will run in the single node. Our solution is using org.apache.hadoop.mapred.lib.CombineTextInputFormat to

Question about resource sharing in Spark Standalone

2014-11-23 Thread Patrick Liu
Dear all, Currently, I am running spark standalone cluster with ~100 nodes. Multiple users can connect to the cluster by Spark-shell or PyShell. However, I can't find an efficient way to control the resources among multiple users. I can set spark.deploy.defaultCores in the server side to

Re: Spark or MR, Scala or Java?

2014-11-23 Thread Krishna Sankar
Good point. On the positive side, whether we choose the most efficient mechanism in Scala might not be as important, as the Spark framework mediates the distributed computation. Even if there is some declarative part in Spark, we can still choose an inefficient computation path that is not

Re: Spark or MR, Scala or Java?

2014-11-23 Thread Krishna Sankar
A very timely article http://rahulkavale.github.io/blog/2014/11/16/scrap-your-map-reduce/ Cheers k/ P.S: Now reply to ALL. On Sun, Nov 23, 2014 at 7:16 PM, Krishna Sankar ksanka...@gmail.com wrote: Good point. On the positive side, whether we choose the most efficient mechanism in Scala might

Re: SparkSQL Timestamp query failure

2014-11-23 Thread Alessandro Panebianco
Hey Daoyuan, following your suggestion I obtain the same result as when I do: where l.timestamp = '2012-10-08 16:10:36.0’ what happens using either your suggestion or simply using single quotes as I just typed in the example before is that the query does not fail but it doesn’t return

RE: SparkSQL Timestamp query failure

2014-11-23 Thread Cheng, Hao
Can you try query like “SELECT timestamp, CAST(timestamp as string) FROM logs LIMIT 5”, I guess you probably ran into the timestamp precision or the timezone shifting problem. (And it’s not mandatory, but you’d better change the field name from “timestamp” to something else, as “timestamp” is

2 spark streaming questions

2014-11-23 Thread tian zhang
Hi, Dear Spark Streaming Developers and Users, We are prototyping using spark streaming and hit the following 2 issues thatI would like to seek your expertise. 1) We have a spark streaming application in scala, that reads  data from Kafka intoa DStream, does some processing and output a

Re: Spark or MR, Scala or Java?

2014-11-23 Thread Sanjay Subramanian
Thanks a ton Ashishsanjay From: Ashish Rangole arang...@gmail.com To: Sanjay Subramanian sanjaysubraman...@yahoo.com Cc: Krishna Sankar ksanka...@gmail.com; Sean Owen so...@cloudera.com; Guillermo Ortiz konstt2...@gmail.com; user user@spark.apache.org Sent: Sunday, November 23, 2014

Re: SparkSQL Timestamp query failure

2014-11-23 Thread whitebread
Cheng thanks, thanks to you I found out that the problem as you guessed was a precision one. 2012-10-08 16:10:36 instead of 2012-10-08 16:10:36.0 Thanks again. Alessandro On Nov 23, 2014, at 11:10 PM, Cheng, Hao [via Apache Spark User List] ml-node+s1001560n19613...@n3.nabble.com wrote: