Adding a new column to a temporary table

2016-05-17 Thread Mich Talebzadeh
Hi, Let us create a DF based on an existing table in Hive using spark-shell scala> val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc) HiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@7c666865 // Go to correct database in Hive scala> HiveConte

Re: JDBC SQL Server RDD

2016-05-17 Thread Suresh Thalamati
What is the error you are getting ? At least on the main code line I see JDBCRDD is marked as private[sql]. Simple alternative might be to call sql server using data frame api , and get rdd from data frame. eg: val df = sqlContext.read.jdbc("jdbc:sqlserver://usaecducc1ew1.ccgaco45mak.us-ea

Re: Code Example of Structured Streaming of 2.0

2016-05-17 Thread Ted Yu
Please take a look at: [SPARK-13146][SQL] Management API for continuous queries [SPARK-14555] Second cut of Python API for Structured Streaming On Mon, May 16, 2016 at 11:46 PM, Todd wrote: > Hi, > > Are there code examples about how to use the structured streaming feature? > Thanks. >

Re:Re: Code Example of Structured Streaming of 2.0

2016-05-17 Thread Todd
Thanks Ted! At 2016-05-17 16:16:09, "Ted Yu" wrote: Please take a look at: [SPARK-13146][SQL] Management API for continuous queries [SPARK-14555] Second cut of Python API for Structured Streaming On Mon, May 16, 2016 at 11:46 PM, Todd wrote: Hi, Are there code examples about how to

Does Structured Streaming support count(distinct) over all the streaming data?

2016-05-17 Thread Todd
Hi, We have a requirement to do count(distinct) in a processing batch against all the streaming data(eg, last 24 hours' data),that is,when we do count(distinct),we actually want to compute distinct against last 24 hours' data. Does structured streaming support this scenario?Thanks!

unsubscribe

2016-05-17 Thread aruna jakhmola

Re: pandas dataframe broadcasted. giving errors in datanode function called kernel

2016-05-17 Thread Jeff Zhang
The following sample code works for me. Could you share your code ? df = DataFrame([1,2,3]) df_b=sc.broadcast(df) def f(a): print(df_b.value) sc.parallelize(range(1,10)).foreach(f) On Sat, May 14, 2016 at 12:59 AM, abi wrote: > pandas dataframe is broadcasted successfully. giving errors in

Re: Spark crashes with Filesystem recovery

2016-05-17 Thread Jeff Zhang
I don't think this related with file system recovery. spark.deploy.recoveryDirectory is standalone configuration which take effect in standalone mode, but you are in local mode. Can you just start pyspark using "bin/pyspark --master local[4]" ? On Wed, May 11, 2016 at 3:52 AM, Imran Akbar wrote:

SparkR query

2016-05-17 Thread Mike Lewis
Hi, I have a SparkR driver process that connects to a master running on Linux, I’ve tried to do a simple test, e.g. sc <- sparkR.init(master="spark://my-linux-host.dev.local:7077", sparkEnvir=list(spark.cores.max="4")) x <- SparkR:::parallelize(sc,1:100,2) y <- count(x) But I can see that the

Re: KafkaUtils.createDirectStream Not Fetching Messages with Confluent Serializers as Value Decoder.

2016-05-17 Thread Jan Uyttenhove
I think that if the Confluent deserializer cannot fetch the schema for the avro message (key and/or value), you end up with no data. You should check the logs of the Schemaregistry, it should show the HTTP requests it receives so you can check if the deserializer can connect to it and if so, what t

Re: SparkR query

2016-05-17 Thread Sun Rui
Lewis, 1. Could you check the values of “SPARK_HOME” environment on all of your worker nodes? 2. How did you start your SparkR shell? > On May 17, 2016, at 18:07, Mike Lewis wrote: > > Hi, > > I have a SparkR driver process that connects to a master running on Linux, > I’ve tried to do a simp

RE: SparkR query

2016-05-17 Thread Mike Lewis
Thanks, I’m just using RStudio. Running locally is fine, just issue with having cluster in Linux and workers looking for Windows path, Which must be being passed through by the driver I guess. I checked the spark-env.sh on each node and the appropriate SPARK_HOME is set correctly…. From: Sun Ru

Why does spark 1.6.0 can't use jar files stored on HDFS

2016-05-17 Thread Serega Sheypak
hi, I'm trying to: 1. upload my app jar files to HDFS 2. run spark-submit with: 2.1. --master yarn --deploy-mode cluster or 2.2. --master yarn --deploy-mode client specifying --jars hdfs:///my/home/commons.jar,hdfs:///my/home/super.jar When spark job is submitted, SparkSubmit client outputs: Warn

spark job is not running on yarn clustor mode

2016-05-17 Thread spark.raj
Hi friends, I am running spark streaming job on yarn cluster mode but it is failing. It is working fine in yarn-client mode. and also spark-examples are running good in spark-cluster mode. below is the log file for the spark streaming job on yarn-cluster mode. Can anyone help me on this. SLF4J:

Re: Why does spark 1.6.0 can't use jar files stored on HDFS

2016-05-17 Thread Jeff Zhang
Do you put your app jar on hdfs ? The app jar must be on your local machine. On Tue, May 17, 2016 at 8:33 PM, Serega Sheypak wrote: > hi, I'm trying to: > 1. upload my app jar files to HDFS > 2. run spark-submit with: > 2.1. --master yarn --deploy-mode cluster > or > 2.2. --master yarn --deploy-

Re: Why does spark 1.6.0 can't use jar files stored on HDFS

2016-05-17 Thread Serega Sheypak
No, and it looks like a problem. 2.2. --master yarn --deploy-mode client means: 1. submit spark as yarn app, but spark-driver is started on local machine. 2. A upload all dependent jars to HDFS and specify jar HDFS paths in --jars arg. 3. Driver runs my Spark Application main class named "MySuperS

yarn-cluster mode error

2016-05-17 Thread spark.raj
Hi, i am getting error below while running application on yarn-cluster mode. ERROR yarn.ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM Anyone can suggest why i am getting this error message? Thanks Raj   Sent from Yahoo Mail. Get the app

Re: Why does spark 1.6.0 can't use jar files stored on HDFS

2016-05-17 Thread Serega Sheypak
https://issues.apache.org/jira/browse/SPARK-10643 Looks like it's the reason... 2016-05-17 15:31 GMT+02:00 Serega Sheypak : > No, and it looks like a problem. > > 2.2. --master yarn --deploy-mode client > means: > 1. submit spark as yarn app, but spark-driver is started on local machine. > 2. A

Re: Why does spark 1.6.0 can't use jar files stored on HDFS

2016-05-17 Thread spark.raj
Hi Serega, Create a jar including all the the dependencies and execute it like below through shell script /usr/local/spark/bin/spark-submit \  //location of your spark-submit --class classname \  //location of your main classname --master yarn \ --deploy-mode cluster \ /home/hadoop/SparkSamplePro

Re: Why does spark 1.6.0 can't use jar files stored on HDFS

2016-05-17 Thread Serega Sheypak
spark-submit --conf "spark.driver.userClassPathFirst=true" --class com.MyClass --master yarn --deploy-mode client --jars hdfs:///my-lib.jar,hdfs:///my-seocnd-lib.jar jar-wth-com-MyClass.jar job_params 2016-05-17 15:41 GMT+02:00 Serega Sheypak : > https://issues.apache.org/jira/browse/SPARK-1064

Re: Why does spark 1.6.0 can't use jar files stored on HDFS

2016-05-17 Thread Serega Sheypak
Hi, I know about that approach. I don't want to run mess of classes from single jar, I want to utilize distributed cache functionality and ship application jar and dependent jars explicitly. --deploy-mode client unfortunately copies and distributes all jars repeatedly for every spark job started fr

Re: spark job is not running on yarn clustor mode

2016-05-17 Thread ayan guha
it says: hdfs://namenode:54310/user/hadoop/.sparkStaging/ application_1463479181441_0003/SparkTwittterStreamingJob-0.0. 1-SNAPSHOT-jar-with-dependencies.jar java.io.FileNotFoundException: File does not exist: hdfs://namenode:54310/user/hadoop/.sparkStaging/application_1463479181441_0003/SparkTwitt

dataframe stat corr for multiple columns

2016-05-17 Thread Ankur Jain
Hello Team, In my current usecase I am loading data from CSV using spark-csv and trying to correlate all variables. As of now if we want to correlate 2 column in a dataframe df.stat.corr works great but if we want to correlate multiple columns this won't work. In case of R we can use corrplot a

Re: Silly Question on my part...

2016-05-17 Thread Michael Segel
Thanks for the response. That’s what I thought, but I didn’t want to assume anything. (You know what happens when you ass u me … :-) Not sure about Tachyon though. Its a thought, but I’m very conservative when it comes to design choices. > On May 16, 2016, at 5:21 PM, John Trengrove >

Re: Silly Question on my part...

2016-05-17 Thread Dood
On 5/16/2016 12:12 PM, Michael Segel wrote: For one use case.. we were considering using the thrift server as a way to allow multiple clients access shared RDDs. Within the Thrift Context, we create an RDD and expose it as a hive table. The question is… where does the RDD exist. On the Thrift

Re: Executor memory requirement for reduceByKey

2016-05-17 Thread Raghavendra Pandey
Even though it does not sound intuitive, reduce by key expects all values for a particular key for a partition to be loaded into memory. So once you increase the partitions you can run the jobs.

Re: My notes on Spark Performance & Tuning Guide

2016-05-17 Thread Raghavendra Pandey
Can you please send me as well. Thanks Raghav On 12 May 2016 20:02, "Tom Ellis" wrote: > I would like to also Mich, please send it through, thanks! > > On Thu, 12 May 2016 at 15:14 Alonso Isidoro wrote: > >> Me too, send me the guide. >> >> Enviado desde mi iPhone >> >> El 12 may 2016, a las 12

Re: My notes on Spark Performance & Tuning Guide

2016-05-17 Thread Natu Lauchande
Hi Mich, I am also interested in the write up. Regards, Natu On Thu, May 12, 2016 at 12:08 PM, Mich Talebzadeh wrote: > Hi Al,, > > > Following the threads in spark forum, I decided to write up on > configuration of Spark including allocation of resources and configuration > of driver, executo

What's the best way to find the Nearest Neighbor row of a matrix with 10billion rows x 300 columns?

2016-05-17 Thread Rex X
Each row of the given matrix is Vector[Double]. Want to find out the nearest neighbor row to each row using cosine similarity. The problem here is the complexity: O( 10^20 ) We need to do *blocking*, and do the row-wise comparison within each block. Any tips for best practice? In Spark, we have

Re: My notes on Spark Performance & Tuning Guide

2016-05-17 Thread rakesh sharma
It would be a rare doc. Please share Get Outlook for Android On Tue, May 17, 2016 at 9:14 AM -0700, "Natu Lauchande" mailto:nlaucha...@gmail.com>> wrote: Hi Mich, I am also interested in the write up. Regards, Natu On Thu, May 12, 2016 at 12:08 PM, Mich Talebzadeh m

Re: yarn-cluster mode error

2016-05-17 Thread Sandeep Nemuri
Can you post the complete stack trace ? ᐧ On Tue, May 17, 2016 at 7:00 PM, wrote: > Hi, > > i am getting error below while running application on yarn-cluster mode. > > *ERROR yarn.ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM* > > Anyone can suggest why i am getting this error message? > > Tha

Error joining dataframes

2016-05-17 Thread ram kumar
Hi, I tried to join two dataframe df_join = df1.*join*(df2, ((df1("Id") === df2("Id")), "fullouter") df_join.registerTempTable("join_test") When querying "Id" from "join_test" 0: jdbc:hive2://> *select Id from join_test;* *Error*: org.apache.spark.sql.AnalysisException: Reference 'Id' is *amb

Re: Silly Question on my part...

2016-05-17 Thread Gene Pang
Hi Michael, Yes, you can use Alluxio to share Spark RDDs. Here is a blog post about getting started with Spark and Alluxio ( http://www.alluxio.com/2016/04/getting-started-with-alluxio-and-spark/), and some documentation ( http://alluxio.org/documentation/master/en/Running-Spark-on-Alluxio.html).

Pls Assist: error when creating cluster on AWS using spark's ec2 scripts

2016-05-17 Thread Marco Mistroni
Hi was wondering if anyone can assist here.. I am trying to create a spark cluster on AWS using scripts located in spark-1.6.1/ec2 directory When the spark_ec2.py scripts tries to do a rsync to copy directories over to teh AWS master node it fails miserably with this stack trace DEBUG:spark ecd

duplicate jar problem in yarn-cluster mode

2016-05-17 Thread satish saley
Hello, I am executing a simple code with yarn-cluster --master yarn-cluster --name Spark-FileCopy --class my.example.SparkFileCopy --properties-file spark-defaults.conf --queue saleyq --executor-memory 1G --driver-memory 1G --conf spark.john.snow.is.back=true --jars hdfs://myclusternn.com:8020/tmp

Inferring schema from GenericRowWithSchema

2016-05-17 Thread Andy Grove
Hi, I have a requirement to create types dynamically in Spark and then instantiate those types from Spark SQL via a UDF. I tried doing the following: val addressType = StructType(List( new StructField("state", DataTypes.StringType), new StructField("zipcode", DataTypes.IntegerType) )) sqlCo

Re: Inferring schema from GenericRowWithSchema

2016-05-17 Thread Michael Armbrust
I don't think that you will be able to do that. ScalaReflection is based on the TypeTag of the object, and thus the schema of any particular object won't be available to it. Instead I think you want to use the register functions in UDFRegistration that take a schema. Does that make sense? On Tue

Re: Does Structured Streaming support count(distinct) over all the streaming data?

2016-05-17 Thread Michael Armbrust
In 2.0 you won't be able to do this. The long term vision would be to make this possible, but a window will be required (like the 24 hours you suggest). On Tue, May 17, 2016 at 1:36 AM, Todd wrote: > Hi, > We have a requirement to do count(distinct) in a processing batch against > all the strea

Re: Inferring schema from GenericRowWithSchema

2016-05-17 Thread Andy Grove
Hmm. I see. Yes, I guess that won't work then. I don't understand what you are proposing about UDFRegistration. I only see methods that take tuples of various sizes (1 .. 22). On Tue, May 17, 2016 at 1:00 PM, Michael Armbrust wrote: > I don't think that you will be able to do that. ScalaReflec

Re: Does Structured Streaming support count(distinct) over all the streaming data?

2016-05-17 Thread Mich Talebzadeh
Ok but how about something similar to val countByValueAndWindow = price.filter(_ > 95.0).countByValueAndWindow(Seconds(windowLength), Seconds(slidingInterval)) Using a new count => c*ountDistinctByValueAndWindow ?* val countDistinctByValueAndWindow = price.filter(_ > 95.0).countDistinctByValueA

Re: Error joining dataframes

2016-05-17 Thread Bijay Kumar Pathak
Hi, Try this one: df_join = df1.*join*(df2, 'Id', "fullouter") Thanks, Bijay On Tue, May 17, 2016 at 9:39 AM, ram kumar wrote: > Hi, > > I tried to join two dataframe > > df_join = df1.*join*(df2, ((df1("Id") === df2("Id")), "fullouter") > > df_join.registerTempTable("join_test") > > > When

Re: Error joining dataframes

2016-05-17 Thread Mich Talebzadeh
pretty simple, a similar construct to tables projected as DF val c = HiveContext.table("channels").select("CHANNEL_ID","CHANNEL_DESC") val t = HiveContext.table("times").select("TIME_ID","CALENDAR_MONTH_DESC") val rs = s.join(t,"time_id").join(c,"channel_id") HTH Dr Mich Talebzadeh LinkedIn *

Re: My notes on Spark Performance & Tuning Guide

2016-05-17 Thread Mich Talebzadeh
Hi all, Many thanks for your tremendous interest in the forthcoming notes. I have had nearly thirty requests and many supporting kind words from the colleagues in this forum. I will strive to get the first draft ready as soon as possible. Apologies for not being more specific. However, hopefully

Re: My notes on Spark Performance & Tuning Guide

2016-05-17 Thread Femi Anthony
Please send it to me as well. Thanks Sent from my iPhone > On May 17, 2016, at 12:09 PM, Raghavendra Pandey > wrote: > > Can you please send me as well. > > Thanks > Raghav > >> On 12 May 2016 20:02, "Tom Ellis" wrote: >> I would like to also Mich, please send it through, thanks! >> >>>

Re: My notes on Spark Performance & Tuning Guide

2016-05-17 Thread Cesar Flores
Please sent me to me too ! Thanks ! ! ! Cesar Flores On Tue, May 17, 2016 at 4:55 PM, Femi Anthony wrote: > Please send it to me as well. > > Thanks > > Sent from my iPhone > > On May 17, 2016, at 12:09 PM, Raghavendra Pandey < > raghavendra.pan...@gmail.com> wrote: > > Can you please send m

Re: My notes on Spark Performance & Tuning Guide

2016-05-17 Thread Abi
Please include me too On May 12, 2016 6:08:14 AM EDT, Mich Talebzadeh wrote: >Hi Al,, > > >Following the threads in spark forum, I decided to write up on >configuration of Spark including allocation of resources and >configuration >of driver, executors, threads, execution of Spark apps and gener

Re: Does Structured Streaming support count(distinct) over all the streaming data?

2016-05-17 Thread Chris Fregly
you can use HyperLogLog with Spark Streaming to accomplish this. here is an example from my fluxcapacitor GitHub repo: https://github.com/fluxcapacitor/pipeline/tree/master/myapps/spark/streaming/src/main/scala/com/advancedspark/streaming/rating/approx here's an accompanying SlideShare presentat

Re: Does Structured Streaming support count(distinct) over all the streaming data?

2016-05-17 Thread Mich Talebzadeh
Thanks Chris, In a nutshell I don't think one can do that. So let us see. Here is my program that is looking for share prices > 95.9. It does work. It is pretty simple import org.apache.spark.SparkContext import org.apache.spark.SparkConf import org.apache.spark.sql.Row import org.apache.spark.

Re: KafkaUtils.createDirectStream Not Fetching Messages with Confluent Serializers as Value Decoder.

2016-05-17 Thread Ramaswamy, Muthuraman
Thank you for the input. Apparently, I was referring to incorrect Schema Registry Server. Once the correct Schema Registry Server IP is used, serializer worked for me. Thanks again, ~Muthu From: Jan Uyttenhove mailto:j...@insidin.com>> Reply-To: "j...@insidin.com" mai

Re: What's the best way to find the Nearest Neighbor row of a matrix with 10billion rows x 300 columns?

2016-05-17 Thread nguyen duc tuan
There's no *RowSimilarity *method in RowMatrix class. You have to transpose your matrix to use that method. However, when the number of rows is large, this approach is still very slow. Try to use approximate nearest neighbor (ANN) methods instead such as LSH. There are several implements of LSH on

How to run hive queries in async mode using spark sql

2016-05-17 Thread Raju Bairishetti
I am using spark sql for running hive queries also. Is there any way to run hive queries in asyc mode using spark sql. Does it return any hive handle or if yes how to get the results from hive handle using spark sql? -- Thanks, Raju Bairishetti, www.lazada.com

Why spark 1.6.1 wokers auto stopped and can not register with master:Worker registration failed: Duplicate worker ID?

2016-05-17 Thread sunday2000
Hi, A client woker auto stoppped, and has this error message, do u know why this happen? 16/05/18 03:34:37 INFO Worker: Master with url spark://master:7077 requested this worker to reconnect. 16/05/18 03:34:37 INFO Worker: Not spawning another attempt to register with the master, since

Re:Re: Does Structured Streaming support count(distinct) over all the streaming data?

2016-05-17 Thread Todd
Thanks you guys for the help.I will try At 2016-05-18 07:17:08, "Mich Talebzadeh" wrote: Thanks Chris, In a nutshell I don't think one can do that. So let us see. Here is my program that is looking for share prices > 95.9. It does work. It is pretty simple import org.apache.spark.Sp

Re: My notes on Spark Performance & Tuning Guide

2016-05-17 Thread Vinayak Agrawal
Please include me too. Vinayak Agrawal Big Data Analytics IBM "To Strive, To Seek, To Find and Not to Yield!" ~Lord Alfred Tennyson > On May 17, 2016, at 2:15 PM, Mich Talebzadeh > wrote: > > Hi all, > > Many thanks for your tremendous interest in the forthcoming notes. I have had > nearly

How to use Kafka as data source for Structured Streaming

2016-05-17 Thread Todd
Hi, I am wondering whether structured streaming supports Kafka as data source. I brief the source code(meanly related with DataSourceRegister trait), and didn't find kafka data source things If Thanks.

Re: My notes on Spark Performance & Tuning Guide

2016-05-17 Thread Jeff Zhang
I think you can write it in gitbook and share it in user mail list then everyone can comment on that. On Wed, May 18, 2016 at 10:12 AM, Vinayak Agrawal < vinayakagrawa...@gmail.com> wrote: > Please include me too. > > Vinayak Agrawal > Big Data Analytics > IBM > > "To Strive, To Seek, To Find and

答复: My notes on Spark Performance & Tuning Guide

2016-05-17 Thread 谭成灶
Thanks for your sharing! Please include me too 发件人: Mich Talebzadeh 发送时间: ‎2016/‎5/‎18 5:16 收件人: user @spark 主题: Re: My notes on Spark Performance & Tuning Guide Hi all, Many thanks for your tremendou

Re: SparkR query

2016-05-17 Thread Sun Rui
I guess that you are using an old version of Spark, 1.4. please try Spark version 1.5+ > On May 17, 2016, at 18:42, Mike Lewis wrote: > > Thanks, I’m just using RStudio. Running locally is fine, just issue with > having cluster in Linux and workers looking for Windows path, > Which must be bei

Re: How to use Kafka as data source for Structured Streaming

2016-05-17 Thread Saisai Shao
It is not supported now, currently only filestream is supported. Thanks Jerry On Wed, May 18, 2016 at 10:14 AM, Todd wrote: > Hi, > I am wondering whether structured streaming supports Kafka as data source. > I brief the source code(meanly related with DataSourceRegister trait), and > didn't fi

How to change output mode to Update

2016-05-17 Thread Todd
scala> records.groupBy("name").count().write.trigger(ProcessingTime("30 seconds")).option("checkpointLocation", "file:///home/hadoop/jsoncheckpoint").startStream("file:///home/hadoop/jsonresult") org.apache.spark.sql.AnalysisException: Aggregations are not supported on streaming DataFrames/Datas

Re: 答复: My notes on Spark Performance & Tuning Guide

2016-05-17 Thread Gajanan Satone
Thanks for sharing, Please consider me. Thanks, Gajanan On Wed, May 18, 2016 at 8:34 AM, 谭成灶 wrote: > Thanks for your sharing! > Please include me too > -- > 发件人: Mich Talebzadeh > 发送时间: ‎2016/‎5/‎18 5:16 > 收件人: user @spark > 主题: Re: My notes on Spark Performance

Load Table as DataFrame

2016-05-17 Thread Mohanraj Ragupathiraj
I have created a DataFrame from a HBase Table (PHOENIX) which has 500 million rows. From the DataFrame I created an RDD of JavaBean and use it for joining with data from a file. Map phoenixInfoMap = new HashMap(); phoenixInfoMap.put("table", tableName); phoenixInfoMap.put("zkUrl", zkURL); DataFram

Re: How to change output mode to Update

2016-05-17 Thread Ted Yu
Have you tried adding: .mode(SaveMode.Overwrite) On Tue, May 17, 2016 at 8:55 PM, Todd wrote: > scala> records.groupBy("name").count().write.trigger(ProcessingTime("30 > seconds")).option("checkpointLocation", > "file:///home/hadoop/jsoncheckpoint").startStream("file:///home/hadoop/jsonresu

Load Table as DataFrame

2016-05-17 Thread Mohanraj Ragupathiraj
I have created a DataFrame from a HBase Table (PHOENIX) which has 500 million rows. From the DataFrame I created an RDD of JavaBean and use it for joining with data from a file. Map phoenixInfoMap = new HashMap(); phoenixInfoMap.put("table", tableName); phoenixInfoMap.put("zkUrl", zkURL); DataFram

SPARK - DataFrame for BulkLoad

2016-05-17 Thread Mohanraj Ragupathiraj
I have 100 million records to be inserted to a HBase table (PHOENIX) as a result of a Spark Job. I would like to know if i convert it to a Dataframe and save it, will it do Bulk load (or) it is not the efficient way to write data to a HBase table -- Thanks and Regards Mohan

Can Pyspark access Scala API?

2016-05-17 Thread Abi
Can Pyspark access Scala API? The accumulator in pysPark does not have local variable available . The Scala API does have it available

Re:Re: How to change output mode to Update

2016-05-17 Thread Todd
Thanks Ted. I didn't try, but I think SaveMode and OuputMode are different things. Currently, the spark code contain two output mode, Append and Update. Append is the default mode,but looks that there is no way to change to Update. Take a look at DataFrameWriter#startQuery Thanks. At 2

Re: Re: How to change output mode to Update

2016-05-17 Thread Saisai Shao
> .mode(SaveMode.Overwrite) >From my understanding mode is not supported in continuous query. def mode(saveMode: SaveMode): DataFrameWriter = { // mode() is used for non-continuous queries // outputMode() is used for continuous queries assertNotStreaming("mode() can only be called on non-co

Re: Re: How to change output mode to Update

2016-05-17 Thread Sachin Aggarwal
Hi, there is some code I have added in jira-15146 please have a look at it, I have not finished it. U can use the same code in ur example as of now On 18-May-2016 10:46 AM, "Saisai Shao" wrote: > > .mode(SaveMode.Overwrite) > > From my understanding mode is not supported in continuous query. > >

Re: Accessing Cassandra data from Spark Shell

2016-05-17 Thread Cassa L
Hi, I followed instructions to run SparkShell with Spark-1.6. It works fine. However, I need to use spark-1.5.2 version. With it, it does not work. I keep getting NoSuchMethod Errors. Is there any issue running Spark Shell for Cassandra using older version of Spark? Regards, LCassa On Tue, May 1

Re: Load Table as DataFrame

2016-05-17 Thread Jörn Franke
Do you have the full source code? Why do you convert a data frame to rdd - this does not make sense to me? > On 18 May 2016, at 06:13, Mohanraj Ragupathiraj wrote: > > I have created a DataFrame from a HBase Table (PHOENIX) which has 500 million > rows. From the DataFrame I created an RDD of J

Re: duplicate jar problem in yarn-cluster mode

2016-05-17 Thread Saisai Shao
I think it is already fixed if your problem is exactly the same as what mentioned in this JIRA (https://issues.apache.org/jira/browse/SPARK-14423). Thanks Jerry On Wed, May 18, 2016 at 2:46 AM, satish saley wrote: > Hello, > I am executing a simple code with yarn-cluster > > --master > yarn-clu

Re:Re: Re: How to change output mode to Update

2016-05-17 Thread Todd
Hi Sachin, Could you please give the url of jira-15146? Thanks! At 2016-05-18 13:33:47, "Sachin Aggarwal" wrote: Hi, there is some code I have added in jira-15146 please have a look at it, I have not finished it. U can use the same code in ur example as of now On 18-May-2016 10:46 AM,

Re: Re: Re: How to change output mode to Update

2016-05-17 Thread Sachin Aggarwal
sorry my mistake i gave wrong id here is correct one https://issues.apache.org/jira/browse/SPARK-15183 On Wed, May 18, 2016 at 11:19 AM, Todd wrote: > Hi Sachin, > > Could you please give the url of jira-15146? Thanks! > > > > > > At 2016-05-18 13:33:47, "Sachin Aggarwal" > wrote: > > Hi, ther

Re: Error joining dataframes

2016-05-17 Thread ram kumar
I tried scala> var df_join = df1.join(df2, "Id", "fullouter") :27: error: type mismatch; found : String("Id") required: org.apache.spark.sql.Column var df_join = df1.join(df2, "Id", "fullouter") ^ scala> And I cant see the above method in https://spa

Re: Error joining dataframes

2016-05-17 Thread Takeshi Yamamuro
You can use the api in spark-v1.6+. https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L454 // maropu On Wed, May 18, 2016 at 3:16 PM, ram kumar wrote: > I tried > > scala> var df_join = df1.join(df2, "Id", "fullouter") > :27: error: typ

Re: Error joining dataframes

2016-05-17 Thread ram kumar
If I run as val rs = s.join(t,"time_id").join(c,"channel_id") It takes as inner join. On Wed, May 18, 2016 at 2:31 AM, Mich Talebzadeh wrote: > pretty simple, a similar construct to tables projected as DF > > val c = HiveContext.table("channels").select("CHANNEL_ID","CHANNEL_DESC") > val t = H

Re: Error joining dataframes

2016-05-17 Thread Takeshi Yamamuro
Also, you can pass the query that you'd like to use in spark-v1.6+; val df1 = Seq((1, 0), (2, 0), (3, 0)).toDF("id", "A") val df2 = Seq((1, 0), (2, 0), (3, 0)).toDF("id", "B") df1.join(df2, df1("id") === df2("id"), "outer").show // maropu On Wed, May 18, 2016 at 3:29 PM, ram kumar wrote: > If

Re: SPARK - DataFrame for BulkLoad

2016-05-17 Thread Takeshi Yamamuro
Hi, Have you checked this? http://mail-archives.apache.org/mod_mbox/spark-user/201311.mbox/%3ccacyzca3askwd-tujhqi1805bn7sctguaoruhd5xtxcsul1a...@mail.gmail.com%3E // maropu On Wed, May 18, 2016 at 1:14 PM, Mohanraj Ragupathiraj < mohanaug...@gmail.com> wrote: > I have 100 million records to be

Re: Error joining dataframes

2016-05-17 Thread ram kumar
I tried df1.join(df2, df1("id") === df2("id"), "outer").show But there is a duplicate "id" and when I query the "id", I get *Error*: org.apache.spark.sql.AnalysisException: Reference 'Id' is *ambiguous*, could be: Id#128, Id#155.; line 1 pos 7 (state=,code=0) I am currently using spark 1.5.2. Is