or that you are seeing.
There are several ways you could fix it. One way is to use a map before the
reduce, e.g.
rdd..map(lambda x: x[1]).reduce(lambda x,y: x + y)
Hope that's helpful,
Chris
-Original Message-
From: capitnfrak...@free.fr
Sent: 19 January 2022 02:41
To: us
The problem is that you are reducing a list of tuples, but you are
producing an int. The resulting int can't be combined with other tuples
with your function. reduce() has to produce the same type as its arguments.
rdd.map(lambda x: x[1]).reduce(lambda x,y: x+y)
... would work
On Tue, Jan 18, 2022
Hello
Please help take a look why my this simple reduce doesn't work?
rdd = sc.parallelize([("a",1),("b",2),("c",3)])
rdd.reduce(lambda x,y: x[1]+y[1])
Traceback (most recent call last):
File "", line 1, in
File "/opt/spark/python/pyspark/rdd.py", line 1001, in reduce
return reduce(f
Thanks Jerry for the clarification.
Ajay.
On Thu, Jul 11, 2019 at 12:48 PM Jerry Vinokurov
wrote:
> Hi Ajay,
>
> When a Spark SQL statement references a table, that table has to be
> "registered" first. Usually the way this is done is by reading in a
> DataFrame, then calling the createOrRepla
Hi Ajay,
When a Spark SQL statement references a table, that table has to be
"registered" first. Usually the way this is done is by reading in a
DataFrame, then calling the createOrReplaceTempView (or one of a few other
functions) on that data frame, with the argument being the name under which
yo
Sorry, i guess i hit the send button too soon
This question is regarding a spark stand-alone cluster. My understanding is
spark is an execution engine and not a storage layer.
Spark processes data in memory but when someone refers to a spark table
created through sparksql(df/rdd) what exactly
This is stand-alone spark cluster. My understanding is spark is an
execution engine and not a storage layer.
Spark processes data in memory but when someone refers to a spark table
created through sparksql(df/rdd) what exactly are they referring to?
Could it be a Hive table? If yes, is it the same
Because of some legacy issues I can't immediately upgrade spark version. But I
try filter data before loading it into spark based on the suggestion by
val df = sparkSession.read.format("jdbc").option(...).option("dbtable",
"(select .. from ... where url <> '') table_name")load()
df
Hi James,
It is always advisable to use the latest SPARK version. That said, can you
please giving a try to dataframes and udf if possible. I think, that would
be a much scalable way to address the issue.
Also in case possible, it is always advisable to use the filter option
before fetching the d
I am very new to Spark. Just successfully setup Spark SQL connecting to
postgresql database, and am able to display table with code
sparkSession.sql("SELECT id, url from table_a where col_b <> '' ").show()
Now I want to perform filter and map function on col_b value. In plain scala it
would
Hi Raghav,
Please refer to the following code:
SparkConf sparkConf = new
SparkConf().setMaster("local[2]").setAppName("PersonApp");
//creating java spark context
JavaSparkContext sc = new JavaSparkContext(sparkConf);
//reading file from hfs into spark rdd , the name node is localhost
JavaRDD p
Sorry I forgot to ask how can I use spark context here ? I have hdfs
directory path of the files, as well as the name node of hdfs cluster.
Thanks for your help.
On Mon, Nov 21, 2016 at 9:45 PM, Raghav wrote:
> Hi
>
> I am extremely new to Spark. I have to read a file form HDFS, and get it
> in
Hi
I am extremely new to Spark. I have to read a file form HDFS, and get it in
memory in RDD format.
I have a Java class as follows:
class Person {
private long UUID;
private String FirstName;
private String LastName;
private String zip;
// public methods
}
The file in HDFS
t;
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Newbie-question-Best-way-to-bootstrap-with-Spark-
> tp28032p28069.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> --
Integrate spark with apache zeppelin https://zeppelin.apache.org/
<https://zeppelin.apache.org/> its again a very handy way to bootstrap
with spark.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Newbie-question-Best-way-to-bootstrap-with
astic MapReduce cluster with
Spark pre-installed, but you'll need to sign up for an AWS account.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Newbie-question-Best-way-to-bootstrap-with-Spark-tp28032p28061.html
Sent from the Apache Spark User List ma
hance to get my
> hands dirty. There are tons of resources for Spark, but I am looking for
> some guidance for starter material, or videos.
>
> Thanks.
>
> Raghav
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/New
dirty. There are tons of resources for Spark, but I am looking for
> some guidance for starter material, or videos.
>
> Thanks.
>
> Raghav
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Newbie-question-Best-way-to-b
t I am looking for
>> some guidance for starter material, or videos.
>>
>> Thanks.
>>
>> Raghav
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Newbie-questio
.com
From: ayan guha
Date: 2016-11-07 10:08
To: raghav
CC: user
Subject: Re: Newbie question - Best way to bootstrap with Spark
I would start with Spark documentation, really. Then you would probably start
with some older videos from youtube, especially spark summit 2014,2015 and 2016
videos. Rega
p Reduce but have not had a chance to get my
> hands dirty. There are tons of resources for Spark, but I am looking for
> some guidance for starter material, or videos.
>
> Thanks.
>
> Raghav
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
>
some guidance for starter material, or videos.
Thanks.
Raghav
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Newbie-question-Best-way-to-bootstrap-with-Spark-tp28032.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
ew SparkConf() is called from
>> main. Top few lines of the exception are pasted below.
>>
>> These are the following versions:
>>
>> Spark jar: spark-assembly-1.6.0-hadoop2.6.0.jar
>> pom: spark-core_2.11
>> 1.6.0
>>
>> I h
IntelliJ. I get a run time error as soon as new SparkConf() is called
>>>> from
>>>> main. Top few lines of the exception are pasted below.
>>>>
>>>> These are the following versions:
>>>>
>>>> Spark jar: spark-assembly-1.6.0-hadoop2.6.
o added a library dependency in the project structure.
>
> Thanks for any help!
>
> Vasu
>
>
> Exception in thread "main" java.lang.NoSuchMethodError:
> scala.Predef$.augmentString(Ljava/lang/String;)Ljava/lang/String;
> at org.apache.spark.util.Utils$
ed a dependency.
>>>
>>> I have also added a library dependency in the project structure.
>>>
>>> Thanks for any help!
>>>
>>> Vasu
>>>
>>>
>>> Exception in thread "main" java.l
s$.(Utils.scala:1682)
>> at org.apache.spark.util.Utils$.(Utils.scala)
>> at org.apache.spark.SparkConf.(SparkConf.scala:59)
>>
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-use
t;> at org.apache.spark.util.Utils$.(Utils.scala:1682)
>> at org.apache.spark.util.Utils$.(Utils.scala)
>> at org.apache.spark.SparkConf.(SparkConf.scala:59)
>>
>>
>>
>>
>>
>>
>>
va/lang/String;
> at org.apache.spark.util.Utils$.(Utils.scala:1682)
> at org.apache.spark.util.Utils$.(Utils.scala)
> at org.apache.spark.SparkConf.(SparkConf.scala:59)
>
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark
l.Utils$.(Utils.scala:1682)
> at org.apache.spark.util.Utils$.(Utils.scala)
> at org.apache.spark.SparkConf.(SparkConf.scala:59)
>
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Newbie-question-H
Ljava/lang/String;)Ljava/lang/String;
at org.apache.spark.util.Utils$.(Utils.scala:1682)
at org.apache.spark.util.Utils$.(Utils.scala)
at org.apache.spark.SparkConf.(SparkConf.scala:59)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.
If the method is not final or static then u can
On Jan 8, 2016 12:07 PM, yuliya Feldman wrote:
Hello,
I am new to Spark and have a most likely basic question - can I override a
method from SparkContext?
Thanks
Thank you
From: Deepak Sharma
To: yuliya Feldman
Cc: "user@spark.apache.org"
Sent: Thursday, January 7, 2016 10:41 PM
Subject: Re: Newbie question
Yes , you can do it unless the method is marked static/final.Most of the
methods in SparkContext are marked static so
You can try it.
> 在 2016年1月8日,14:44,yuliya Feldman 写道:
>
> invoked
e.org"
Sent: Thursday, January 7, 2016 10:38 PM
Subject: Re: Newbie question
why to override a method from SparkContext?
在 2016年1月8日,14:36,yuliya Feldman 写道:
Hello,
I am new to Spark and have a most likely basic question - can I override a
method from SparkContext?
Thanks
Yes , you can do it unless the method is marked static/final.
Most of the methods in SparkContext are marked static so you can't over
ride them definitely , else over ride would work usually.
Thanks
Deepak
On Fri, Jan 8, 2016 at 12:06 PM, yuliya Feldman wrote:
> Hello,
>
> I am new to Spark and
why to override a method from SparkContext?
> 在 2016年1月8日,14:36,yuliya Feldman 写道:
>
> Hello,
>
> I am new to Spark and have a most likely basic question - can I override a
> method from SparkContext?
>
> Thanks
Hello,
I am new to Spark and have a most likely basic question - can I override a
method from SparkContext?
Thanks
spark-user-list.1001560.n3.nabble.com/Spark-ML-MLib-newbie-question-tp25129.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional c
1) Spark only needs to shuffle when data needs to be partitioned around the
workers in an all-to-all fashion.
2) Multi-stage jobs that would normally require several map reduce jobs,
thus causing data to be dumped to disk between the jobs can be cached in
memory.
This blog outlines a few things that make Spark faster than MapReduce -
https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
On Fri, Aug 7, 2015 at 9:13 AM, Muler wrote:
> Consider the classic word count application over a 4 node cluster with a
> sizable working data. What makes Spark
Consider the classic word count application over a 4 node cluster with a
sizable working data. What makes Spark ran faster than MapReduce
considering that Spark also has to write to disk during shuffle?
Thanks!
On Wed, Aug 5, 2015 at 5:24 PM, Saisai Shao wrote:
> Yes, finally shuffle data will be written to disk for reduce stage to
> pull, no matter how large you set to shuffle memory fraction.
>
> Thanks
> Saisai
>
> On Thu, Aug 6, 2015 at 7:50 AM, Muler wrote:
>
>> thanks, so if I have enoug
Yes, finally shuffle data will be written to disk for reduce stage to pull,
no matter how large you set to shuffle memory fraction.
Thanks
Saisai
On Thu, Aug 6, 2015 at 7:50 AM, Muler wrote:
> thanks, so if I have enough large memory (with enough
> spark.shuffle.memory) then shuffle (in-memory
thanks, so if I have enough large memory (with enough spark.shuffle.memory)
then shuffle (in-memory shuffle) spill doesn't happen (per node) but still
shuffle data has to be ultimately written to disk so that reduce stage
pulls if across network?
On Wed, Aug 5, 2015 at 4:40 PM, Saisai Shao wrote:
Hi Muler,
Shuffle data will be written to disk, no matter how large memory you have,
large memory could alleviate shuffle spill where temporary file will be
generated if memory is not enough.
Yes, each node writes shuffle data to file and pulled from disk in reduce
stage from network framework (d
Hi,
Consider I'm running WordCount with 100m of data on 4 node cluster.
Assuming my RAM size on each node is 200g and i'm giving my executors 100g
(just enough memory for 100m data)
1. If I have enough memory, can Spark 100% avoid writing to disk?
2. During shuffle, where results have to b
You need to change `== 1` to `== i`. `println(t)` happens on the
workers, which may not be what you want. Try the following:
noSets.filter(t => model.predict(Utils.featurize(t)) ==
i).collect().foreach(println)
-Xiangrui
On Sat, Mar 7, 2015 at 3:20 PM, Pierce Lamb
wrote:
> Hi all,
>
> I'm very
Hi all,
I'm very new to machine learning algorithms and Spark. I'm follow the
Twitter Streaming Language Classifier found here:
http://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/README.html
Specifically this code:
http://databricks.gitbooks.io/data
Hello Mixtou, if you want to look at partition ID, I believe you want to use
mapPartitionsWithIndex
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Newbie-Question-on-How-Tasks-are-Executed-tp21064p21228.html
Sent from the Apache Spark User List mailing
;
}
def estimateGuaranteedFrequentWords(): Unit = {
frequent_words_counters.foreach{tuple =>
if (tuple._2(0) - tuple._2(1) < words_no*fi) {
guaranteed_words -= tuple._1;
}
else {
System.out.println("Guaranteed Word : "+tuple._1+" with co
rning process :-)
Plus IMHO , if u r planning on learning Spark, I would say YES to Scala and NO
to Java. Yes its a diff paradigm but being a Java and Hadoop programmer for
many years, I am excited to learn Scala as the language and use Spark. Its
exciting.
regards
sanjay
From: Aniket Bh
Go through spark API documentation. Basically you have to do group by
(date, message_type) and then do a count.
On Sun, Jan 4, 2015, 9:58 PM Dinesh Vallabhdas
wrote:
> A spark cassandra newbie question. Thanks in advance for the help.
> I have a cassandra table with 2 columns message_tim
A spark cassandra newbie question. Appreciate the help.u...@host.com
I have a cassandra table with 2 columns message_timestamp(timestamp) and
message_type(text). The data is of the form
2014-06-25 12:01:39 "START"
2014-06-25 12:02:39 "START"
2014-06-25 12:02:39 "PAUSE&q
A spark cassandra newbie question. Thanks in advance for the help.I have a
cassandra table with 2 columns message_timestamp(timestamp) and
message_type(text). The data is of the form2014-06-25 12:01:39 "START"
2014-06-25 12:02:39 "START"
2014-06-25 12:02:39 "PAUSE&q
Hi guys,
I'm planning to use spark on a project and I'm facing a problem, I
couldn't find a log that explains what's wrong with what I'm doing.
I have 2 vms that run a small hadoop (2.6.0) cluster. I added a file that
has a 50 lines of json data
Compiled spark, all tests passed, I run some si
#L7-8)
>> [warn]+- simple-project:simple-project_2.10:1.0
>> sbt.ResolveException: unresolved dependency:
>> org.apache.spark#spark-core_2.10;1.1.0: not found
>>
>> What am I doing wrong?
>>
>> Regards Hans-Peter
>
;> [warn]
>> [warn] Note: Unresolved dependencies path:
>> [warn] org.apache.spark:spark-core_2.10:1.1.0
>> (/root/simple.sbt#L7-8)
>> [warn]+- simple-project:simple-project_2.10:1.0
>> sbt.ResolveException: unresolved dependency:
>> org.a
che.spark:spark-core_2.10:1.1.0
> (/root/simple.sbt#L7-8)
> [warn]+- simple-project:simple-project_2.10:1.0
> sbt.ResolveException: unresolved dependency:
> org.apache.spark#spark-core_2.10;1.1.0: not found
>
> What am I doing wrong?
>
> Regards Hans-Peter
>
>
>
?
Regards Hans-Peter
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/newbie-question-quickstart-example-sbt-issue-tp17477.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
Hi All,
In a JAVA based scenario where we have a large Oracle DB and want to use
spark to do some distributed analysis being done on the data -- in such
case how exactly we go about defining a JDBC connection and querying the
data
thanks,
--
Ahmed Osama Ibrahim
ITSC International Technology
61 matches
Mail list logo