As Mohit said, making Main extend Serializable should fix this example. In
general, it's not a bad idea to mark the fields you don't want to serialize
(e.g., sc and conf in this case) as @transient as well, though this is not
the issue in this case.
Note that this problem would not have arisen in
Here, the Main object is not meant to be serialized. transient ought
to be for fields that are within an object that is legitimately
supposed to be serialized, but, whose value can be recreated on
deserialization. I feel like marking objects that aren't logically
Serializable as such is a hack,
Thanks Sean, I was actually using instances created elsewhere inside my RDD
transformations which as I understand is against Spark programming model. I
was referred to a talk about UIMA and Spark integration from this year's
Spark summit, which had a workaround for this problem. I just had to make
I am a newbie as well to Spark. Been Hadoop/Hive/Oozie programming extensively
before this. I use Hadoop(Java MR code)/Hive/Impala/Presto on a daily basis.
To get me jumpstarted into Spark I started this gitHub where there is
IntelliJ-ready-To-run code (simple examples of jon, sparksql etc) and
Hi guys ,
Im trying to do the Spark SQL Programming Guide but after the:
case class Person(name: String, age: Int)
// Create an RDD of Person objects and register it as a table.
val people =
sc.textFile(examples/src/main/resources/people.txt).map(_.split(,)).map(p
= Person(p(0), p(1).trim.toInt))
This being a very broad topic, a discussion can quickly get subjective.
I'll try not to deviate from my experiences and observations to keep this
thread useful to those looking for answers.
I have used Hadoop MR (with Hive, MR Java apis, Cascading and Scalding) as
well as Spark (since v 0.6) in
I am trying to run network_wordcount.py example mentioned at
https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/network_wordcount.py
on CDH5.2 Quickstart VM. Getting below error.
Traceback (most recent call last):
File
By any chance are you using Spark 1.0.2? registerTempTable was introduced
from Spark 1.1+ while for Spark 1.0.2, it would be registerAsTable.
On Sun Nov 23 2014 at 10:59:48 AM riginos samarasrigi...@gmail.com wrote:
Hi guys ,
Im trying to do the Spark SQL Programming Guide but after the:
That was the problem ! Thank you Denny for your fast response!
Another quick question:
Is there any way to update spark to 1.1.0 fast?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Programming-Guide-registerTempTable-Error-tp19591p19595.html
Can you please suggest sample data for running the logistic_regression.py?
I am trying to use a sample data file at
https://github.com/apache/spark/blob/master/data/mllib/sample_linear_regression_data.txt
I am running this on CDH5.2 Quickstart VM.
[cloudera@quickstart mllib]$ spark-submit
On Sun, Nov 23, 2014 at 1:03 PM, Ashish Rangole arang...@gmail.com wrote:
Java or Scala : I knew Java already yet I learnt Scala when I came across
Spark. As others have said, you can get started with a little bit of Scala
and learn more as you progress. Once you have started using Scala for a
It sort of depends on your environment. If you are running on your local
environment, I would just download the latest Spark 1.1 binaries and you'll
be good to go. If its a production environment, it sort of depends on how
you are setup (e.g. AWS, Cloudera, etc.)
On Sun Nov 23 2014 at 11:27:49
Hi,
I have a column in my schemaRDD that is a map but I'm unable to convert it
to a map.. I've tried converting it to a Tuple2[String,String]:
val converted = jsonFiles.map(line= {
line(10).asInstanceOf[Tuple2[String,String]]})
but I get ClassCastException:
14/11/23 11:51:30 WARN
Hello. Okay, so I'm working on a project to run analytic processing using
Spark or PySpark. Right now, I connect to the shell and execute my
commands. The very first part of my commands is: create an SQL JDBC
connection and cursor to pull from Apache Phoenix, do some processing on
the returned
Hi Dibyendu,
Thank you for answer. I will try the Spark-Kafka consumer.
Bill
On Sat, Nov 22, 2014 at 9:15 PM, Dibyendu Bhattacharya
dibyendu.bhattach...@gmail.com wrote:
I believe this is something to do with how Kafka High Level API manages
consumers within a Consumer group and how it
Alaa,
one option is to use Spark as a cache, importing subset of data from
hbase/phoenix that fits in memory, and using jdbcrdd to get more data on
cache miss. The front end can be created with pyspark and flusk, either as
rest api translating json requests to sparkSQL dialect, or simply
Hi,
I am trying to insert particular set of data from rdd to a hive table I
have Map[String,Map[String,Int]] in scala which I want to insert into the
table of mapstring,maplt;string,int I was able to create the table but
while inserting it says scala.MatchError:
Hi,
I am new to spark. This is the first time I am posting here. Currently, I
try to implement ADMM optimization algorithms for Lasso/SVM
Then I come across a problem:
Since the training data(label, feature) is large, so I created a RDD and
cached the training data(label, feature ) in memory.
I think this IS possible?
You must set the HADOOP_CONF_DIR variable on the machine you’re running the
Java process that creates the SparkContext. The Hadoop configuration specifies
the YARN ResourceManager IPs, and Spark will use that configuration.
mn
On Nov 21, 2014, at 8:10 AM, Prannoy
I have 20 nodes via EC2 and an application that reads the data via
wholeTextFiles. I've tried to copy the data into hadoop via
copyFromLocal, and I get
14/11/24 02:00:07 INFO hdfs.DFSClient: Exception in
createBlockOutputStream 172.31.2.209:50010 java.io.IOException: Bad
connect ack with
Hi,
I think you can try
cast(l.timestamp as string)='2012-10-08 16:10:36.0'
Thanks,
Daoyuan
-Original Message-
From: whitebread [mailto:ale.panebia...@me.com]
Sent: Sunday, November 23, 2014 12:11 AM
To: u...@spark.incubator.apache.org
Subject: Re: SparkSQL Timestamp query failure
We encountered similar problem. If all partitions are located in the same
node and all of the tasks run less than 3 seconds (set by
spark.locality.wait, the default value is 3000), the tasks will run in
the single node. Our solution is
using org.apache.hadoop.mapred.lib.CombineTextInputFormat to
Dear all,
Currently, I am running spark standalone cluster with ~100 nodes.
Multiple users can connect to the cluster by Spark-shell or PyShell.
However, I can't find an efficient way to control the resources among multiple
users.
I can set spark.deploy.defaultCores in the server side to
Good point.
On the positive side, whether we choose the most efficient mechanism in
Scala might not be as important, as the Spark framework mediates the
distributed computation. Even if there is some declarative part in Spark,
we can still choose an inefficient computation path that is not
A very timely article
http://rahulkavale.github.io/blog/2014/11/16/scrap-your-map-reduce/
Cheers
k/
P.S: Now reply to ALL.
On Sun, Nov 23, 2014 at 7:16 PM, Krishna Sankar ksanka...@gmail.com wrote:
Good point.
On the positive side, whether we choose the most efficient mechanism in
Scala might
Hey Daoyuan,
following your suggestion I obtain the same result as when I do:
where l.timestamp = '2012-10-08 16:10:36.0’
what happens using either your suggestion or simply using single quotes as I
just typed in the example before is that the query does not fail but it doesn’t
return
Can you try query like “SELECT timestamp, CAST(timestamp as string) FROM logs
LIMIT 5”, I guess you probably ran into the timestamp precision or the timezone
shifting problem.
(And it’s not mandatory, but you’d better change the field name from
“timestamp” to something else, as “timestamp” is
Hi, Dear Spark Streaming Developers and Users,
We are prototyping using spark streaming and hit the following 2 issues thatI
would like to seek your expertise.
1) We have a spark streaming application in scala, that reads data from Kafka
intoa DStream, does some processing and output a
Thanks a ton Ashishsanjay
From: Ashish Rangole arang...@gmail.com
To: Sanjay Subramanian sanjaysubraman...@yahoo.com
Cc: Krishna Sankar ksanka...@gmail.com; Sean Owen so...@cloudera.com;
Guillermo Ortiz konstt2...@gmail.com; user user@spark.apache.org
Sent: Sunday, November 23, 2014
Cheng thanks,
thanks to you I found out that the problem as you guessed was a precision one.
2012-10-08 16:10:36 instead of 2012-10-08 16:10:36.0
Thanks again.
Alessandro
On Nov 23, 2014, at 11:10 PM, Cheng, Hao [via Apache Spark User List]
ml-node+s1001560n19613...@n3.nabble.com wrote:
30 matches
Mail list logo