Re: Error joining dataframes

2016-05-18 Thread Takeshi Yamamuro
Ah, yes. `df_join` has the two `id`, so you need to select which id you use; scala> :paste // Entering paste mode (ctrl-D to finish) val df1 = Seq((1, 0), (2, 0)).toDF("id", "A") val df2 = Seq((2, 0), (3, 0)).toDF("id", "B") val df3 = df1.join(df2, df1("id") === df2("id"), "outer")

Re: SPARK - DataFrame for BulkLoad

2016-05-18 Thread Takeshi Yamamuro
Hi, Have you checked this? http://mail-archives.apache.org/mod_mbox/spark-user/201311.mbox/%3ccacyzca3askwd-tujhqi1805bn7sctguaoruhd5xtxcsul1a...@mail.gmail.com%3E // maropu On Wed, May 18, 2016 at 1:14 PM, Mohanraj Ragupathiraj < mohanaug...@gmail.com> wrote: > I have 100 million records to

Re: Error joining dataframes

2016-05-18 Thread ram kumar
If I run as val rs = s.join(t,"time_id").join(c,"channel_id") It takes as inner join. On Wed, May 18, 2016 at 2:31 AM, Mich Talebzadeh wrote: > pretty simple, a similar construct to tables projected as DF > > val c =

Re: Re: Re: How to change output mode to Update

2016-05-18 Thread Sachin Aggarwal
sorry my mistake i gave wrong id here is correct one https://issues.apache.org/jira/browse/SPARK-15183 On Wed, May 18, 2016 at 11:19 AM, Todd wrote: > Hi Sachin, > > Could you please give the url of jira-15146? Thanks! > > > > > > At 2016-05-18 13:33:47, "Sachin Aggarwal"

Re: Error joining dataframes

2016-05-18 Thread Takeshi Yamamuro
You can use the api in spark-v1.6+. https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L454 // maropu On Wed, May 18, 2016 at 3:16 PM, ram kumar wrote: > I tried > > scala> var df_join = df1.join(df2, "Id",

Re: Error joining dataframes

2016-05-18 Thread ram kumar
I tried df1.join(df2, df1("id") === df2("id"), "outer").show But there is a duplicate "id" and when I query the "id", I get *Error*: org.apache.spark.sql.AnalysisException: Reference 'Id' is *ambiguous*, could be: Id#128, Id#155.; line 1 pos 7 (state=,code=0) I am currently using spark 1.5.2. Is

Re: Error joining dataframes

2016-05-18 Thread ram kumar
When you register a temp table from the dataframe eg: var df_join = df1.join(df2, df1("id") === df2("id"), "outer") df_join.registerTempTable("test") sqlContext.sql("select * from test") +++++ | id| A| id| B| +++++ | 1| 0|null|null| | 2| 0| 2|

Re: Error joining dataframes

2016-05-18 Thread ram kumar
I tried scala> var df_join = df1.join(df2, "Id", "fullouter") :27: error: type mismatch; found : String("Id") required: org.apache.spark.sql.Column var df_join = df1.join(df2, "Id", "fullouter") ^ scala> And I cant see the above method in

Re: Can Pyspark access Scala API?

2016-05-18 Thread Ted Yu
Not sure if you have seen this (for 2.0): [SPARK-15087][CORE][SQL] Remove AccumulatorV2.localValue and keep only value Can you tell us your use case ? On Tue, May 17, 2016 at 9:16 PM, Abi wrote: > Can Pyspark access Scala API? The accumulator in pysPark does not

Re: Can Pyspark access Scala API?

2016-05-18 Thread Abi
Thanks for that. But the question is more general. Can pyspark access Scala somehow ? On May 18, 2016 3:53:50 PM EDT, Ted Yu wrote: >Not sure if you have seen this (for 2.0): > >[SPARK-15087][CORE][SQL] Remove AccumulatorV2.localValue and keep only >value > >Can you tell us

Re: Unit testing framework for Spark Jobs?

2016-05-18 Thread Todd Nist
Perhaps these may be of some use: https://github.com/mkuthan/example-spark http://mkuthan.github.io/blog/2015/03/01/spark-unit-testing/ https://github.com/holdenk/spark-testing-base On Wed, May 18, 2016 at 2:14 PM, swetha kasireddy wrote: > Hi Lars, > > Do you have

Re: Is there a way to run a jar built for scala 2.11 on spark 1.6.1 (which is using 2.10?)

2016-05-18 Thread Ted Yu
Depending on the version of hadoop you use, you may find tar ball prebuilt with Scala 2.11: https://s3.amazonaws.com/spark-related-packages FYI On Wed, May 18, 2016 at 3:35 PM, Koert Kuipers wrote: > no but you can trivially build spark 1.6.1 for scala 2.11 > > On Wed, May

Re: Is there a way to run a jar built for scala 2.11 on spark 1.6.1 (which is using 2.10?)

2016-05-18 Thread Koert Kuipers
no but you can trivially build spark 1.6.1 for scala 2.11 On Wed, May 18, 2016 at 6:11 PM, Sergey Zelvenskiy wrote: > >

Spark UI metrics - Task execution time and number of records processed

2016-05-18 Thread Nirav Patel
Hi, I have been noticing that for shuffled tasks(groupBy, Join) reducer tasks are not evenly loaded. Most of them (90%) finished super fast but there are some outliers that takes much longer as you can see from "Max" value in following metric. Metric is from Join operation done on two RDDs. I

Does Structured Streaming support Kafka as data source?

2016-05-18 Thread Todd
Hi, I brief the spark code, and it looks that structured streaming doesn't support kafka as data source yet?

How to perform reduce operation in the same order as partition indexes

2016-05-18 Thread Pulasthi Supun Wickramasinghe
Hi Devs/All, I am pretty new to Spark. I have a program which does some map reduce operations with matrices. Here *shortrddFinal* is a of type " *RDD[Array[Short]]"* and consists of several partitions *var BC = shortrddFinal.mapPartitionsWithIndex(calculateBCInternal).reduce(mergeBC)* The map

RE: KafkaUtils.createDirectStream Not Fetching Messages with Confluent Serializers as Value Decoder.

2016-05-18 Thread Ramaswamy, Muthuraman
I am using Spark 1.6.1 and Kafka 0.9+ It works for both receiver and receiver-less mode. One thing I noticed when you specify invalid topic name, KafkaUtils doesn't fetch any messages. So, check you have specified the topic name correctly. ~Muthu From:

Re: Accessing Cassandra data from Spark Shell

2016-05-18 Thread Cassa L
I tried all combinations of spark-cassandra connector. Didn't work. Finally, I downgraded spark to 1.5.1 and now it works. LCassa On Wed, May 18, 2016 at 11:11 AM, Mohammed Guller wrote: > As Ben mentioned, Spark 1.5.2 does work with C*. Make sure that you are > using

Re: Managed memory leak detected.SPARK-11293 ?

2016-05-18 Thread Ted Yu
Please increase the number of partitions. Cheers On Wed, May 18, 2016 at 4:17 AM, Serega Sheypak wrote: > Hi, please have a look at log snippet: > 16/05/18 03:27:16 INFO spark.MapOutputTrackerWorker: Doing the fetch; > tracker endpoint = >

Re: spark udf can not change a json string to a map

2016-05-18 Thread Ted Yu
Please take a look at JavaUtils#mapAsSerializableJavaMap FYI On Mon, May 16, 2016 at 3:24 AM, 喜之郎 <251922...@qq.com> wrote: > > hi, Ted. > I found a built-in function called str_to_map, which can transform string > to map. > But it can not meet my need. > > Because my string is maybe a map with

Re: File not found exception while reading from folder using textFileStream

2016-05-18 Thread Ted Yu
The following should handle the situation you encountered: diff --git a/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala b/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.sca index ed93058..f79420b 100644 ---

Spark Task not serializable with lag Window function

2016-05-18 Thread luca_guerra
I've noticed that after I use a Window function over a DataFrame if I call a map() with a function, Spark returns a "Task not serializable" Exception This is my code: val hc = new org.apache.spark.sql.hive.HiveContext(sc) import hc.implicits._ import org.apache.spark.sql.expressions.Window import

Re: SPARK - DataFrame for BulkLoad

2016-05-18 Thread Ted Yu
Please see HBASE-14150 The hbase-spark module would be available in the upcoming hbase 2.0 release. On Tue, May 17, 2016 at 11:48 PM, Takeshi Yamamuro wrote: > Hi, > > Have you checked this? > >

Re: SPARK - DataFrame for BulkLoad

2016-05-18 Thread Michael Segel
Yes, but he’s using phoenix which may not work cleanly with your HBase spark module. They key issue here may be Phoenix which is separate from HBase. > On May 18, 2016, at 5:36 AM, Ted Yu wrote: > > Please see HBASE-14150 > > The hbase-spark module would be available

[Spark 2.0 state store] Streaming wordcount using spark state store

2016-05-18 Thread Shekhar Bansal
Hi What is the right way of using spark2.0 state store feature in spark streaming??I referred test cases in this(https://github.com/apache/spark/pull/11645/files) pull request and implemented word count using state store.My source is kafka(1 topic, 10 partitions). My data pump is pushing

Managed memory leak detected.SPARK-11293 ?

2016-05-18 Thread Serega Sheypak
Hi, please have a look at log snippet: 16/05/18 03:27:16 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://mapoutputtrac...@xxx.xxx.xxx.xxx:38128) 16/05/18 03:27:16 INFO spark.MapOutputTrackerWorker: Got the output locations 16/05/18 03:27:16 INFO

HBase / Spark Kerberos problem

2016-05-18 Thread philipp.meyerhoefer
Hi all, I have been puzzling over a Kerberos problem for a while now and wondered if anyone can help. For spark-submit, I specify --keytab x --principal y, which creates my SparkContext fine. Connections to Zookeeper Quorum to find the HBase master work well too. But when it comes to a

Re: Error joining dataframes

2016-05-18 Thread Divya Gehlot
Can you try var df_join = df1.join(df2,df1( "Id") ===df2("Id"), "fullouter").drop(df1("Id")) On May 18, 2016 2:16 PM, "ram kumar" wrote: I tried scala> var df_join = df1.join(df2, "Id", "fullouter") :27: error: type mismatch; found : String("Id") required:

Re: Error joining dataframes

2016-05-18 Thread ram kumar
I tried it, eg: df_join = df1.join(df2,df1( "Id") ===df2("Id"), "fullouter") +++++ | id| A| id| B| +++++ | 1| 0|null|null| | 2| 0| 2| 0| |null|null| 3| 0| +++++ if I try, df_join = df1.join(df2,df1( "Id")

File not found exception while reading from folder using textFileStream

2016-05-18 Thread Yogesh Vyas
Hi, I am trying to read the files in a streaming way using Spark Streaming. For this I am copying files from my local folder to the source folder from where spark reads the file. After reading and printing some of the files, it gives the following error: Caused by:

Re: Does Structured Streaming support count(distinct) over all the streaming data?

2016-05-18 Thread Sean Owen
Late to the thread, but, why is counting distinct elements over a 24-hour window not possible? you can certainly do it now, and I'd presume it's possible with structured streaming with a window. countByValueAndWindow should do it right? the keys (with non-zero counts, I suppose) in a window are

Re: Error joining dataframes

2016-05-18 Thread Takeshi Yamamuro
Look weird, seems spark-v1.5.x can accept the query. What's the difference between the example and your query? Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.5.2 /_/ scala> :paste //

Re: Managed memory leak detected.SPARK-11293 ?

2016-05-18 Thread Serega Sheypak
Switching from snappy to lzf helped me: *spark.io.compression.codec=lzf* Do you know why? :) I can't find exact explanation... 2016-05-18 15:41 GMT+02:00 Ted Yu : > Please increase the number of partitions. > > Cheers > > On Wed, May 18, 2016 at 4:17 AM, Serega Sheypak

Re: Managed memory leak detected.SPARK-11293 ?

2016-05-18 Thread Ted Yu
According to: http://blog.erdemagaoglu.com/post/4605524309/lzo-vs-snappy-vs-lzf-vs-zlib-a-comparison-of performance of snappy and lzf were on-par to each other. Maybe lzf has lower memory requirement. On Wed, May 18, 2016 at 7:22 AM, Serega Sheypak wrote: > Switching

Re: Managed memory leak detected.SPARK-11293 ?

2016-05-18 Thread Serega Sheypak
Ok, it happens only in YARN+cluster mode. It works with snappy in YARN+client mode. I've started to hit this problem when I switched to cluster mode. 2016-05-18 16:31 GMT+02:00 Ted Yu : > According to: > >

Submit python egg?

2016-05-18 Thread Darren Govoni
Hi  I have a python egg with a __main__.py in it. I am able to execute the egg by itself fine. Is there a way to just submit the egg to spark and have it run? It seems an external .py script is needed which would be unfortunate if true. Thanks Sent from my Verizon Wireless 4G LTE

Re: KafkaUtils.createDirectStream Not Fetching Messages with Confluent Serializers as Value Decoder.

2016-05-18 Thread Mail.com
Adding back users. > On May 18, 2016, at 11:49 AM, Mail.com wrote: > > Hi Uladzimir, > > I run is as below. > > Spark-submit --class com.test --num-executors 4 --executor-cores 5 --queue > Dev --master yarn-client --driver-memory 512M --executor-memory 512M test.jar

Re: 2 tables join happens at Hive but not in spark

2016-05-18 Thread Davies Liu
What the schema of the two tables looks like? Could you also show the explain of the query? On Sat, Feb 27, 2016 at 2:10 AM, Sandeep Khurana wrote: > Hello > > We have 2 tables (tab1, tab2) exposed using hive. The data is in different > hdfs folders. We are trying to join

Re: [Spark 2.0 state store] Streaming wordcount using spark state store

2016-05-18 Thread Michael Armbrust
The state store for structured streaming is an internal concept, and isn't designed to be consumed by end users. I'm hoping to write some documentation on how to do aggregation, but support for reading from Kafka and other sources will likely come in Spark 2.1+ On Wed, May 18, 2016 at 3:50 AM,

SLF4J binding error while running Spark using YARN as Cluster Manager

2016-05-18 Thread Anubhav Agarwal
Hi, I am having log4j trouble while running Spark using YARN as cluster manager in CDH 5.3.3. I get the following error:- SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in

Re: SLF4J binding error while running Spark using YARN as Cluster Manager

2016-05-18 Thread Marcelo Vanzin
Hi Anubhav, This is happening because you're trying to use the configuration generated for CDH with upstream Spark. The CDH configuration will add extra needed jars that we don't include in our build of Spark, so you'll end up getting duplicate classes. You can either try to use a different

RE: Accessing Cassandra data from Spark Shell

2016-05-18 Thread Mohammed Guller
As Ben mentioned, Spark 1.5.2 does work with C*. Make sure that you are using the correct version of the Spark Cassandra Connector. Mohammed Author: Big Data Analytics with Spark From: Ben Slater

Re: Unit testing framework for Spark Jobs?

2016-05-18 Thread swetha kasireddy
Hi Lars, Do you have any examples for the methods that you described for Spark batch and Streaming? Thanks! On Wed, Mar 30, 2016 at 2:41 AM, Lars Albertsson wrote: > Thanks! > > It is on my backlog to write a couple of blog posts on the topic, and > eventually some example