Re: pyspark unable to convert dataframe column to a vector: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

2016-03-29 Thread Jeff Zhang
According the stack trace, it seems the HiveContext is not initialized correctly. Do you have any more error message ? On Tue, Mar 29, 2016 at 9:29 AM, Andy Davidson < a...@santacruzintegration.com> wrote: > I am using pyspark spark-1.6.1-bin-hadoop2.6 and python3. I have a data > frame with a

Re: Null pointer exception when using com.databricks.spark.csv

2016-03-29 Thread Hyukjin Kwon
Hi, I guess this is not a CSV-datasource specific problem. Does loading any file (eg. textFile()) work as well? I think this is related with this thread, http://apache-spark-user-list.1001560.n3.nabble.com/Error-while-running-example-scala-application-using-spark-submit-td10056.html .

Null pointer exception when using com.databricks.spark.csv

2016-03-29 Thread Selvam Raman
Hi, i am using spark 1.6.0 prebuilt hadoop 2.6.0 version in my windows machine. i was trying to use databricks csv format to read csv file. i used the below command. [image: Inline image 1] I got null pointer exception. Any help would be greatly appreciated. [image: Inline image 2] --

aggregateByKey on PairRDD

2016-03-29 Thread Suniti Singh
Hi All, I have an RDD having the data in the following form : tempRDD: RDD[(String, (String, String))] (brand , (product, key)) ("amazon",("book1","tech")) ("eBay",("book1","tech")) ("barns",("book","tech")) ("amazon",("book2","tech")) I would like to group the data by Brand and would

Re: DataFrame --> JSON objects, instead of un-named array of fields

2016-03-29 Thread 刘虓
Hi, Besides your solution ,yon can use df.write.format('json').save('a.json') 2016-03-29 4:11 GMT+08:00 Russell Jurney : > To answer my own question, DataFrame.toJSON() does this, so there is no > need to map and json.dump(): > > >

Re: Spark and N-tier architecture

2016-03-29 Thread Mark Hamstra
Our difference is mostly over whether n-tier means what it meant long ago, or whether it is a malleable concept that can be stretched without breaking to cover newer architectures. As I said before, if n-tier helps you think about Spark, then use it; if it doesn't, don't force it. On Tue, Mar

Re: Spark and N-tier architecture

2016-03-29 Thread Alexander Pivovarov
Here is a step-by-step instruction on how to run it on EMR (or other) clusters https://github.com/spark-jobserver/spark-jobserver/blob/master/doc/EMR.md On Tue, Mar 29, 2016 at 4:44 PM, Gavin Yue wrote: > It is a separate project based on my understanding. I am

Re: Spark and N-tier architecture

2016-03-29 Thread Gavin Yue
n-tiers or layers is mainly for separate a big problem into pieces smaller problem. So it is always valid. Just for different application, it means different things. Speaking of offline analytics, or big data eco-world, there are numerous way of slicing the problem into different tier/layer.

Re: Spark and N-tier architecture

2016-03-29 Thread Alexander Pivovarov
Spark-jobserver was originally created by Ooyala Now it's Open Source Apache Licensed project ​

Re: Spark and N-tier architecture

2016-03-29 Thread Gavin Yue
It is a separate project based on my understanding. I am currently evaluating it right now. > On Mar 29, 2016, at 16:17, Michael Segel wrote: > > > >> Begin forwarded message: >> >> From: Michael Segel >> Subject: Re: Spark and N-tier

Re: Spark and N-tier architecture

2016-03-29 Thread Mich Talebzadeh
Hi Mark, I beg I agree to differ on the interpretation of N-tier architecture. Agreed that 3-tier and by extrapolation N-tier have been around since days of client-server architecture. However, they are as valid today as 20 years ago. I believe the main recent expansion of n-tier has been on

Re: Spark and N-tier architecture

2016-03-29 Thread Mark Hamstra
Yes and no. The idea of n-tier architecture is about 20 years older than Spark and doesn't really apply to Spark as n-tier was original conceived. If the n-tier model helps you make sense of some things related to Spark, then use it; but don't get hung up on trying to force a Spark architecture

Fwd: Spark and N-tier architecture

2016-03-29 Thread Michael Segel
> Begin forwarded message: > > From: Michael Segel > Subject: Re: Spark and N-tier architecture > Date: March 29, 2016 at 4:16:44 PM MST > To: Alexander Pivovarov > Cc: Mich Talebzadeh , Ashok Kumar > ,

Re: Spark and N-tier architecture

2016-03-29 Thread Ashok Kumar
Thank you both. So am I correct that Spark fits in within the application tier in N-tier architecture? On Tuesday, 29 March 2016, 23:50, Alexander Pivovarov wrote: Spark is a distributed data processing engine plus distributed in-memory / disk data cache 

Re: Unable to execute query on SAPHANA using SPARK

2016-03-29 Thread Mich Talebzadeh
I concur with Gourav on this. Both SAP HANNA and Oracle exalytics push serial scans into HW with HANA, it is by pushing the bitmaps into the L2 cache on the chip whilst Oracle has special processors on SPARC T5 called D that offloads the column bit scan off the cpu and onto separate

Re: Spark and N-tier architecture

2016-03-29 Thread Alexander Pivovarov
Spark is a distributed data processing engine plus distributed in-memory / disk data cache spark-jobserver provides REST API to your spark applications. It allows you to submit jobs to spark and get results in sync or async mode It also can create long running Spark context to cache RDDs in

Re: Running Spark on Yarn

2016-03-29 Thread Alexander Pivovarov
ok, start EMR-4.3.0 or 4.2.0 cluster and look at how to configure spark on yarn properly

Re: Running Spark on Yarn

2016-03-29 Thread Vineet Mishra
:~/Downloads/package/spark-1.6.1-bin-hadoop2.6$ bin/spark-shell --master yarn-client 16/03/30 03:24:43 DEBUG ipc.Client: IPC Client (111576772) connection to myhost/192.168.1.108:8032 from myhost sending #138 16/03/30 03:24:43 DEBUG ipc.Client: IPC Client (111576772) connection to

Re: Running Spark on Yarn

2016-03-29 Thread Vineet Mishra
Looks like still the same while the other MR application is working fine, On Wed, Mar 30, 2016 at 3:15 AM, Alexander Pivovarov wrote: > for small cluster set the following settings > > yarn-site.xml > > > yarn.scheduler.minimum-allocation-mb > 32 > > > >

Re: Spark and N-tier architecture

2016-03-29 Thread Mich Talebzadeh
Interesting question. The most widely used application of N-tier is the traditional three-tier architecture that has been the backbone of Client-server architecture by having presentation layer, application layer and data layer. This is primarily for performance, scalability and maintenance. The

data frame problem preserving sort order with repartition() and coalesce()

2016-03-29 Thread Andy Davidson
I have a requirement to write my results out into a series of CSV files. No file may have more than 100 rows of data. In the past my data was not sorted, and I was able to use reparation() or coalesce() to ensure the file length requirement. I realize that reparation() cause the data to be

Re: Running Spark on Yarn

2016-03-29 Thread Alexander Pivovarov
for small cluster set the following settings yarn-site.xml yarn.scheduler.minimum-allocation-mb 32 capacity-scheduler.xml yarn.scheduler.capacity.maximum-am-resource-percent 0.5 Maximum percent of resources in the cluster which can be used to run application

Re: Running Spark on Yarn

2016-03-29 Thread Vineet Mishra
Yarn seems to be running fine, I have successful MR jobs completed on the same, *Cluster Metrics* *Apps Submitted Apps Pending Apps Running Apps Completed Containers Running Memory Used Memory Total Memory Reserved VCores Used VCores Total VCores Reserved Active Nodes Decommissioned Nodes Lost

Re: Running Spark on Yarn

2016-03-29 Thread Alexander Pivovarov
check resource manager and node manager logs. Maybe you find smth explaining why 1 app is pending do you have any app run successfully? *Apps Completed is 0 on the UI* On Tue, Mar 29, 2016 at 2:13 PM, Vineet Mishra wrote: > Hi Alex/Surendra, > > Hadoop is up and

Re: SparkML RandomForest java.lang.StackOverflowError

2016-03-29 Thread Eugene Morozov
Joseph, I'm using 1.6.0. -- Be well! Jean Morozov On Tue, Mar 29, 2016 at 10:09 PM, Joseph Bradley wrote: > First thought: 70K features is *a lot* for the MLlib implementation (and > any PLANET-like implementation) > > Using fewer partitions is a good idea. > > Which

Re: Running Spark on Yarn

2016-03-29 Thread Vineet Mishra
Hi Alex/Surendra, Hadoop is up and running fine and I am able to run example on the same. *Cluster Metrics* *Apps Submitted Apps Pending Apps Running Apps Completed Containers Running Memory Used Memory Total Memory Reserved VCores Used VCores Total VCores Reserved Active Nodes Decommissioned

Spark and N-tier architecture

2016-03-29 Thread Ashok Kumar
Experts, One of terms used and I hear is N-tier architecture within Big Data used for availability, performance etc. I also hear that Spark by means of its query engine and in-memory caching fits into middle tier (application layer) with HDFS and Hive may be providing the data tier.  Can

Re: Running Spark on Yarn

2016-03-29 Thread Alexander Pivovarov
check 8088 ui - how many cores and memory available - how many slaves are active run teragen or pi from hadoop examples to make sure that yarn works On Tue, Mar 29, 2016 at 1:25 PM, Surendra , Manchikanti < surendra.manchika...@gmail.com> wrote: > Hi Vineeth, > > Can you please check

Re: Is streaming bisecting k-means possible?

2016-03-29 Thread dustind
The data source is elasticsearch, and I intend to use their module for spark support which provides a RDD, if that matters. -- View this message in context:

Re: Running Spark on Yarn

2016-03-29 Thread Surendra , Manchikanti
Hi Vineeth, Can you please check resource(RAM,Cores) availability in your local cluster, And change accordingly. Regards, Surendra M -- Surendra Manchikanti On Tue, Mar 29, 2016 at 1:15 PM, Vineet Mishra wrote: > Hi All, > > While starting Spark on Yarn on local

Using spark.memory.useLegacyMode true does not yield expected behavior

2016-03-29 Thread Tom Hubregtsen
Hi, I am trying to get the same memory behavior in Spark 1.6 as I had in Spark 1.3 with default settings. I set --driver-java-options "--Dspark.memory.useLegacyMode=true -Dspark.shuffle.memoryFraction=0.2 -Dspark.storage.memoryFraction=0.6 -Dspark.storage.unrollFraction=0.2" in Spark 1.6. But

Running Spark on Yarn

2016-03-29 Thread Vineet Mishra
Hi All, While starting Spark on Yarn on local cluster(Single Node Hadoop 2.6 yarn) I am facing some issues. As I try to start the Spark Shell it keeps on iterating in a endless loop while initiating, *6/03/30 01:32:38 DEBUG ipc.Client: IPC Client (1782965120) connection to

Re: SparkML RandomForest java.lang.StackOverflowError

2016-03-29 Thread Joseph Bradley
First thought: 70K features is *a lot* for the MLlib implementation (and any PLANET-like implementation) Using fewer partitions is a good idea. Which Spark version was this on? On Tue, Mar 29, 2016 at 5:21 AM, Eugene Morozov wrote: > The questions I have in mind: >

Vectors.sparse exception: TypeError: indices array must be sorted

2016-03-29 Thread Andy Davidson
I am using pyspark 1.6.1 and python3 Any idea what my bug is? Clearly the indices are being sorted? Could it be the numDimensions = 713912692155621377 and my indices are longs not ints? import numpy as np from pyspark.mllib.linalg import Vectors from pyspark.mllib.linalg import VectorUDT #sv1

Re: looking for an easy to to find the max value of a column in a data frame

2016-03-29 Thread Andy Davidson
Nice From: Alexander Krasnukhin Date: Tuesday, March 29, 2016 at 10:42 AM To: Andrew Davidson Cc: "user @spark" Subject: Re: looking for an easy to to find the max value of a column in a data frame > You can

Re: Unable to execute query on SAPHANA using SPARK

2016-03-29 Thread Gourav Sengupta
Hi Reena, Why would you want to run a SPARK off data in SAP HANA? Is not SAP HANA already an in memory, columnar storage, SAP bells-and-whistles, super-duper expensive way of doing what poor people do in SPARK sans SAP ERP integration layers? I am just trying to understand the used case here.

Re: looking for an easy to to find the max value of a column in a data frame

2016-03-29 Thread Alexander Krasnukhin
You can even use the fact that pyspark has dynamic properties rows = idDF2.select(max("col[id]").alias("max")).collect() firstRow = rows[0] max = firstRow.max On Tue, Mar 29, 2016 at 7:14 PM, Alexander Krasnukhin wrote: > You should be able to index columns directly

Re: looking for an easy to to find the max value of a column in a data frame

2016-03-29 Thread Alexander Krasnukhin
You should be able to index columns directly either by index or column name i.e. from pyspark.sql.functions import max rows = idDF2.select(max("col[id]")).collect() firstRow = rows[0] # by index max = firstRow[0] # by column name max = firstRow["max(col[id])"] On Tue, Mar 29, 2016 at 6:58 PM,

Fwd: Master options Cluster/Client descrepencies.

2016-03-29 Thread satyajit vegesna
Hi All, I have written a spark program on my dev box , IDE:Intellij scala version:2.11.7 spark verison:1.6.1 run fine from IDE, by providing proper input and output paths including master. But when i try to deploy the code in my cluster made of below, Spark

Re: looking for an easy to to find the max value of a column in a data frame

2016-03-29 Thread Andy Davidson
Hi Alexander Many thanks. I think the key was I needed to import that max function. Turns out you do not need to use col Df.select(max(³foo²)).show() To get the actual value of max you still need to write more code than I would expect. I wonder if there is a easier way to work with Rows? In

Re: Strange ML pipeline errors from HashingTF using v1.6.1

2016-03-29 Thread Timothy Potter
FWIW - I synchronized access to the transformer and the problem went away so this looks like some type of concurrent access issue when dealing with UDFs On Tue, Mar 29, 2016 at 9:19 AM, Timothy Potter wrote: > It's a local spark master, no cluster. I'm not sure what you

Re: Sending events to Kafka from spark job

2016-03-29 Thread Andy Davidson
Hi Fanoos I would be careful about using collect(). You need to make sure you local computer has enough memory to hold your entire data set. Eventually I will need to do something similar. I have to written the code yet. My plan is to load the data into a data frame and then write a UDF that

Re: overriding spark.streaming.blockQueueSize default value

2016-03-29 Thread Spark Newbie
Pinging back. Hope someone else has seen this behavior where spark.streaming.blockQueueSize becomes a bottleneck. Is there a suggestion on how to adjust the queue size? Or any documentation on what the effects would be. It seems to be straightforward. But just trying to learn from others

Re: Stream are not serializable

2016-03-29 Thread hokam chauhan
Hi Crakjie, Did you find the solution for the below problems? Regards, Hokam -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Stream-are-not-serializable-tp25185p26630.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Unable to execute query on SAPHANA using SPARK

2016-03-29 Thread Mich Talebzadeh
Sounds like Drive issue as it can prepare the DF but cannot collect it. In general as long you have the correct JDBC drivers it should work. Have you modified spark-defaults.conf and added the driver list there? For example here I have both Oracle and Sybase IQ drivers

Re: Strange ML pipeline errors from HashingTF using v1.6.1

2016-03-29 Thread Timothy Potter
It's a local spark master, no cluster. I'm not sure what you mean about assembly or package? all of the Spark dependencies are on my classpath and this sometimes works. On Mon, Mar 28, 2016 at 11:45 PM, Jacek Laskowski wrote: > Hi, > > How do you run the pipeline? Do you

Re: Unable to execute query on SAPHANA using SPARK

2016-03-29 Thread Ted Yu
As the error said, com.sap.db.jdbc.topology.Host is not serializable. Maybe post question on Sap Hana mailing list (if any) ? On Tue, Mar 29, 2016 at 7:54 AM, reena upadhyay < reena.upadh...@impetus.co.in> wrote: > I am trying to execute query using spark sql on SAP HANA from spark > shell. I

Re: run spark job

2016-03-29 Thread Steve Loughran
On 29 Mar 2016, at 14:30, Fei Hu > wrote: Hi Jeff, Thanks for your info! I am developing a workflow system based on Oozie, but it only supports java and mapreduce now, so I want to run spark job as in local mode by the workflow system first, then

Re: Unable to Limit UI to localhost interface

2016-03-29 Thread David O'Gwynn
/etc/hosts 127.0.0.1 localhost conf/slaves 127.0.0.1 On Mon, Mar 28, 2016 at 5:36 PM, Mich Talebzadeh wrote: > in your /etc/hosts what do you have for localhost > > 127.0.0.1 localhost.localdomain localhost > > conf/slave should have one entry in your case > > cat

Re: run spark job

2016-03-29 Thread Fei Hu
Hi Jeff, Thanks for your info! I am developing a workflow system based on Oozie, but it only supports java and mapreduce now, so I want to run spark job as in local mode by the workflow system first, then extend the workflow system to run spark job on Yarn. Best wishes, Fei > On Mar 29,

Spark streaming spilling all the data to disk even if memory available

2016-03-29 Thread Mayur Mohite
Hi, We are running spark streaming app on a single machine and we have configured spark executor memory to 30G. We noticed that after running the app for 12 hours, spark streaming started spilling ALL the data to disk even though we have configured sufficient memory for spark to use for storage.

Re: SparkML RandomForest java.lang.StackOverflowError

2016-03-29 Thread Eugene Morozov
The questions I have in mind: Is it smth that the one might expect? From the stack trace itself it's not clear where does it come from. Is it an already known bug? Although I haven't found anything like that. Is it possible to configure something to workaround / avoid this? I'm not sure it's the

Re: [Spark SQL] Unexpected Behaviour

2016-03-29 Thread Jerry Lam
Hi guys, Another point is that if this is unsupported shouldn't it throw an exception instead of giving the wrong answer? I mean if d1.join(d2, "id").select(d2("label")) should not work at all, the proper behaviour is to throw the analysis exception. It now returns a wrong answer though. As I

Re: [Spark SQL] Unexpected Behaviour

2016-03-29 Thread Jerry Lam
Hi Divya, This is not a self-join. d1 and d2 contain totally different rows. They are derived from the same table. The transformation that are applied to generate d1 and d2 should be able to disambiguate the labels in the question. Best Regards, Jerry On Tue, Mar 29, 2016 at 2:43 AM, Divya

Re: How to reduce the Executor Computing Time.

2016-03-29 Thread Ted Yu
Can you disclose snippet of your code ? Which Spark release do you use ? Thanks > On Mar 29, 2016, at 3:42 AM, Charan Adabala wrote: > > From the below image how can we reduce the computing time for the stages, at > some stages the Executor Computing Time is less than

Re: Sending events to Kafka from spark job

2016-03-29 Thread fanooos
I think I find a solution but I have no idea how this affects the execution of the application. At the end of the script I added a sleep statement. import time time.sleep(1) This solved the problem. -- View this message in context:

Re: Do not wrap result of a UDAF in an Struct

2016-03-29 Thread Michał Zieliński
Matthias, You don't need StructType, you can have ArrayType directly def bufferSchema: StructType = StructType(StructField("vals", DataTypes.createArrayType(StringType)) :: Nil) def dataType: DataType = DataTypes.createArrayType(StringType) def evaluate(buffer: Row): Any =

Re: Direct Kafka input stream and window(…) function

2016-03-29 Thread Martin Soch
Hi Cody, thanks for your answer. I have finally managed to create simple sample code. Here it is: import kafka.serializer.StringDecoder; import org.apache.spark.SparkConf; import org.apache.spark.streaming.Durations; import org.apache.spark.streaming.api.java.*; import

How to reduce the Executor Computing Time.

2016-03-29 Thread Charan Adabala
>From the below image how can we reduce the computing time for the stages, at some stages the Executor Computing Time is less than 1 sec and some are consuming more than 10sec. Can any one help how to reduce the Executor Computing Time. Thanks in Advance...

hadoop.ParquetOutputCommitter: could not write summary file

2016-03-29 Thread 李铖
an error occured when write parquet files to disk. any advise? I want to know the reason.thanks ``` 16/03/29 18:31:48 WARN hadoop.ParquetOutputCommitter: could not write summary file for file:/tmp/goods/2015-6 java.lang.NullPointerException at

Sending events to Kafka from spark job

2016-03-29 Thread fanooos
I am trying to read stored tweets in a file and send it to Kafka using Spark python. The code is very simple but it does not work. The spark job runs correctly but nothing sent to Kafka Here is the code /#!/usr/bin/python # -*- coding: utf-8 -*- from pyspark import SparkContext, SparkConf

Do not wrap result of a UDAF in an Struct

2016-03-29 Thread Matthias Niehoff
Hi, given is a simple DF: root |-- id1: string (nullable = true) |-- id2: string (nullable = true) |-- val: string (nullable = true) I run an UDAF on this DF with groupBy($“id1“,$“id2“).agg(udaf($“val“) as „valsStruct“). The aggregates simply stores all val in Set. The result is: root |--

Continuously INFO JobScheduler:59 - Added jobs for time *** ms, in my Spark Standalone Cluster.

2016-03-29 Thread Charan Adabala
We are working with Spark Standalone Cluster with 8 Cores and 32GB Ram, with 3 nodes cluster with same configuration. Some times streaming batch completed in less than 1sec. some times it takes more than 10 secs at that time below log will appears in console. 2016-03-29 11:35:25,044 INFO

SparkML RandomForest java.lang.StackOverflowError

2016-03-29 Thread Eugene Morozov
Hi, I have a web service that provides rest api to train random forest algo. I train random forest on a 5 nodes spark cluster with enough memory - everything is cached (~22 GB). On a small datasets up to 100k samples everything is fine, but with the biggest one (400k samples and ~70k features)

Re: StreamCorruptedException during deserialization

2016-03-29 Thread Robert Schmidtke
All the Jars and Java versions are consistent in my setup. In fact, I have Spark sorting 1TB of data using the exact same setup, except with another file system as storage for the data nodes. Could it be that there is actual corruption in the files written? On Tue, Mar 29, 2016 at 12:00 PM, Simon

Re: StreamCorruptedException during deserialization

2016-03-29 Thread Simon Hafner
2016-03-29 11:25 GMT+02:00 Robert Schmidtke : > Is there a meaningful way for me to find out what exactly is going wrong > here? Any help and hints are greatly appreciated! Maybe a version mismatch between the jars on the cluster?

StreamCorruptedException during deserialization

2016-03-29 Thread Robert Schmidtke
Hi everyone, I'm running the Intel HiBench TeraSort (1TB) Spark Scala benchmark on Spark 1.6.0. After some time, I'm seeing one task fail too many times, despite being rescheduled on different nodes with the following stacktrace: 16/03/27 22:25:04 WARN scheduler.TaskSetManager: Lost task 97.0 in

Re: Change TimeZone Setting in Spark 1.5.2

2016-03-29 Thread Mich Talebzadeh
If you start Spark on the host it will pickup the host timezone If you want to change to different timezone then in the OS user for Spark set the timezone with TZ parameter in your Spark env file For example export TZ=GB and then start-master,sh HTH Dr Mich Talebzadeh LinkedIn *

Re: Does SparkSql has official jdbc/odbc driver?

2016-03-29 Thread Mich Talebzadeh
Many vendors like Progress Direct provide JDBC/ODBC drivers for Hive. As stated before Spark is effectively a query tool not a Data Warehouse like Hive. I am curious to know why you want to use Spark SQL on Hive tables whereas Hive itself provides richer SQL (in other words Spark SQL is a subset

Re: Does SparkSql has official jdbc/odbc driver?

2016-03-29 Thread Jorge Machado
This is for Hive not for Spark, these updates differ from updates on rdbms. What they actually do they add it at the end of your file and then it uses a compaction process that only keeps the last record. Take a look at : https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions

Re: Does SparkSql has official jdbc/odbc driver?

2016-03-29 Thread Jorge Machado
This is for Hive not for Spark, there updates from different as updates on rdbms. What they actually do they add it and the end of your file and then it uses a compaction process that only keeps the last record. Take a look at :

Re: Does SparkSql has official jdbc/odbc driver?

2016-03-29 Thread Sage Meng
Hi, according to https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-NewConfigurationParametersforTransactions it seems that insert/delete/update can be supported, but I haven't try it. 2016-03-29 16:23 GMT+08:00 Jorge Machado : > Hi, > > you

Re: Does SparkSql has official jdbc/odbc driver?

2016-03-29 Thread Jorge Machado
Hi, you should know that “Spark” is not a relation database. So updates on data as you are used to in RDMS are not possible. Jorge Machado www.jmachado.me > On 29/03/2016, at 10:21, Sage Meng wrote: > > thanks, I found that I can use hive's jdbc driver to connect to

Re: Does SparkSql has official jdbc/odbc driver?

2016-03-29 Thread Sage Meng
thanks, I found that I can use hive's jdbc driver to connect to spark sql. I am curious whether simba's jdbc/odbc drivers to spark sql can support all standard sql statements, since I haven't tried third-party's jdbc/odbc driver and it seems that hive's jdbc driver can't support

Re: run spark job

2016-03-29 Thread Jeff Zhang
Yes you can. But this is actually what spark-submit does for you. Actually spark-submit do more than that. You can refer here https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala What's your purpose for using "java -cp", for local development,

Re: Does SparkSql has official jdbc/odbc driver?

2016-03-29 Thread alexpw
sage wrote > Hi all, > Does SparkSql has official jdbc/odbc driver? I only saw third-party's > jdbc/odbc driver. Hi Sage, Databricks licenses ODBC driver from Simba Technologies. Here's the link to announcement:

Change TimeZone Setting in Spark 1.5.2

2016-03-29 Thread Divya Gehlot
Hi, The Spark set up is on Hadoop cluster. How can I set up the Spark timezone to sync with Server Timezone ? Any idea? Thanks, Divya

Re: [Spark SQL] Unexpected Behaviour

2016-03-29 Thread Jerry Lam
Hi Sunitha, Thank you for the reference Jira. It looks like this is the bug I'm hitting. Most of the bugs related to this seems to associate with dataframes derived from the one dataframe (base in this case). In SQL, this is a self-join and dropping d2.label should not affect d1.label. There are