Re: Add partition support in saveAsParquet

2015-03-28 Thread Michael Armbrust
This is something we are hoping to support in Spark 1.4. We'll post more information to JIRA when there is a design. On Thu, Mar 26, 2015 at 11:22 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, Anyone has similar request? https://issues.apache.org/jira/browse/SPARK-6561 When we

Re: Can spark sql read existing tables created in hive

2015-03-27 Thread Michael Armbrust
Are you running on yarn? - If you are running in yarn-client mode, set HADOOP_CONF_DIR to /etc/hive/conf/ (or the directory where your hive-site.xml is located). - If you are running in yarn-cluster mode, the easiest thing to do is to add--files=/etc/hive/conf/hive-site.xml (or the path for

Re: Spark SQL queries hang forever

2015-03-26 Thread Michael Armbrust
Is it possible to jstack the executors and see where they are hanging? On Thu, Mar 26, 2015 at 2:02 PM, Jon Chase jon.ch...@gmail.com wrote: Spark 1.3.0 on YARN (Amazon EMR), cluster of 10 m3.2xlarge (8cpu, 30GB), executor memory 20GB, driver memory 10GB I'm using Spark SQL, mainly via

Re: Hive Table not from from Spark SQL

2015-03-26 Thread Michael Armbrust
What does show tables return? You can also run SET optionName to make sure that entries from you hive site are being read correctly. On Thu, Mar 26, 2015 at 4:02 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I have tables dw_bid that is created in Hive and has nothing to do with Spark. I have

Re: Missing an output location for shuffle. : (

2015-03-26 Thread Michael Armbrust
I would suggest looking for errors in the logs of your executors. On Thu, Mar 26, 2015 at 3:20 AM, 李铖 lidali...@gmail.com wrote: Again,when I do larger file Spark-sql query, error occured.Anyone have got fix it .Please help me. Here is the track.

Re: column expression in left outer join for DataFrame

2015-03-25 Thread Michael Armbrust
tried the simpler join ( i.e. df_2.join(df_1) ) and got the same error stated above. I would like to know what is wrong with the join statement above. thanks On Tue, Mar 24, 2015 at 6:08 PM, Michael Armbrust mich...@databricks.com wrote: You need to use `===`, so

Re: OutOfMemoryError when using DataFrame created by Spark SQL

2015-03-25 Thread Michael Armbrust
You should also try increasing the perm gen size: -XX:MaxPermSize=512m On Wed, Mar 25, 2015 at 2:37 AM, Ted Yu yuzhih...@gmail.com wrote: Can you try giving Spark driver more heap ? Cheers On Mar 25, 2015, at 2:14 AM, Todd Leo sliznmail...@gmail.com wrote: Hi, I am using *Spark SQL*

Re: What are the best options for quickly filtering a DataFrame on a single column?

2015-03-25 Thread Michael Armbrust
The only way to do in using python currently is to use the string based filter API (where you pass us an expression as a string, and we parse it using our SQL parser). from pyspark.sql import Row from pyspark.sql.functions import * df = sc.parallelize([Row(name=test)]).toDF() df.filter(name in

Re: Can a DataFrame be saved to s3 directly using Parquet?

2015-03-25 Thread Michael Armbrust
Until then you can try sql(SET spark.sql.parquet.useDataSourceApi=false) On Wed, Mar 25, 2015 at 12:15 PM, Michael Armbrust mich...@databricks.com wrote: This will be fixed in Spark 1.3.1: https://issues.apache.org/jira/browse/SPARK-6351 and is fixed in master/branch-1.3 if you want

Re: What are the best options for quickly filtering a DataFrame on a single column?

2015-03-25 Thread Michael Armbrust
way to do this that lines up more naturally with the way things are supposed to be done in SparkSQL? On Wed, Mar 25, 2015 at 2:29 PM, Michael Armbrust mich...@databricks.com wrote: The only way to do in using python currently is to use the string based filter API (where you pass us

Re: Can a DataFrame be saved to s3 directly using Parquet?

2015-03-25 Thread Michael Armbrust
This will be fixed in Spark 1.3.1: https://issues.apache.org/jira/browse/SPARK-6351 and is fixed in master/branch-1.3 if you want to compile from source On Wed, Mar 25, 2015 at 11:59 AM, Stuart Layton stuart.lay...@gmail.com wrote: I'm trying to save a dataframe to s3 as a parquet file but I'm

Re: column expression in left outer join for DataFrame

2015-03-25 Thread Michael Armbrust
. WHERE tab1.country = tab2.country) and had no problems getting the correct result. thanks On Wed, Mar 25, 2015 at 11:05 AM, Michael Armbrust mich...@databricks.com wrote: Unfortunately you are now hitting a bug (that is fixed in master and will be released in 1.3.1 hopefully next week

Re: trouble with jdbc df in python

2015-03-25 Thread Michael Armbrust
Try: db = sqlContext.load(source=jdbc, url=jdbc:postgresql://localhost/xx, dbtables=mstr.d_customer) On Wed, Mar 25, 2015 at 2:19 PM, elliott cordo elliottco...@gmail.com wrote: if i run the following: db = sqlContext.load(jdbc, url=jdbc:postgresql://localhost/xx, dbtables=mstr.d_customer)

Re: trouble with jdbc df in python

2015-03-25 Thread Michael Armbrust
://spark.apache.org/docs/latest/sql-programming-guide.html#dataframe-operations needs to be updated: [image: Inline image 1] On Wed, Mar 25, 2015 at 6:12 PM, Michael Armbrust mich...@databricks.com wrote: Try: db = sqlContext.load(source=jdbc, url=jdbc:postgresql://localhost/xx, dbtables=mstr.d_customer

Re: filter expression in API document for DataFrame

2015-03-25 Thread Michael Armbrust
Yeah sorry, this is already fixed but we need to republish the docs. I'll add both of the following do work: people.filter(age 30) people.filter(people(age) 30) On Tue, Mar 24, 2015 at 7:11 PM, SK skrishna...@gmail.com wrote: The following statement appears in the Scala API example at

Re: 1.3 Hadoop File System problem

2015-03-24 Thread Michael Armbrust
You are probably hitting SPARK-6351 https://issues.apache.org/jira/browse/SPARK-6351, which will be fixed in 1.3.1 (hopefully cutting an RC this week). On Tue, Mar 24, 2015 at 4:55 PM, Jim Carroll jimfcarr...@gmail.com wrote: I have code that works under 1.2.1 but when I upgraded to 1.3.0 it

Re: column expression in left outer join for DataFrame

2015-03-24 Thread Michael Armbrust
You need to use `===`, so that you are constructing a column expression instead of evaluating the standard scala equality method. Calling methods to access columns (i.e. df.county is only supported in python). val join_df = df1.join( df2, df1(country) === df2(country), left_outer) On Tue, Mar

Re: Dataframe groupby custom functions (python)

2015-03-24 Thread Michael Armbrust
The only UDAFs that we support today are those defined using the Hive UDAF API. Otherwise you'll have to drop into Spark operations. I'd suggest opening a JIRA. On Tue, Mar 24, 2015 at 10:49 AM, jamborta jambo...@gmail.com wrote: Hi all, I have been trying out the new dataframe api in 1.3,

Re: Question about Data Sources API

2015-03-24 Thread Michael Armbrust
My question wrt Java/Scala was related to extending the classes to support new custom data sources, so was wondering if those could be written in Java, since our company is a Java shop. Yes, you should be able to extend the required interfaces using Java. The additional push downs I am

Re: SparkSQL UDTs with Ordering

2015-03-24 Thread Michael Armbrust
I'll caution that the UDTs are not a stable public interface yet. We'd like to do this someday, but currently this feature is mostly for MLlib as we have not finalized the API. Having an ordering could be useful, but I'll add that currently UDTs actually exist in serialized from so the ordering

Re: Question about Data Sources API

2015-03-24 Thread Michael Armbrust
On Tue, Mar 24, 2015 at 12:57 AM, Ashish Mukherjee ashish.mukher...@gmail.com wrote: 1. Is the Data Source API stable as of Spark 1.3.0? It is marked DeveloperApi, but in general we do not plan to change even these APIs unless there is a very compelling reason to. 2. The Data Source API

Re: SchemaRDD/DataFrame result partitioned according to the underlying datasource partitions

2015-03-23 Thread Michael Armbrust
There is not an interface to this at this time, and in general I'm hesitant to open up interfaces where the user could make a mistake where they think something is going to improve performance but will actually impact correctness. Since, as you say, we are picking the partitioner automatically in

Re: Should Spark SQL support retrieve column value from Row by column name?

2015-03-22 Thread Michael Armbrust
Please open a JIRA, we added the info to Row that will allow this to happen, but we need to provide the methods you are asking for. I'll add that this does work today in python (i.e. row.columnName). On Sun, Mar 22, 2015 at 12:40 AM, amghost zhengweita...@gmail.com wrote: I would like to

Re: DataFrame saveAsTable - partitioned tables

2015-03-22 Thread Michael Armbrust
Note you can use HiveQL syntax for creating dynamically partitioned tables though. On Sun, Mar 22, 2015 at 1:29 PM, Michael Armbrust mich...@databricks.com wrote: Not yet. This is on the roadmap for Spark 1.4. On Sun, Mar 22, 2015 at 12:19 AM, deenar.toraskar deenar.toras...@db.com wrote

Re: DataFrame saveAsTable - partitioned tables

2015-03-22 Thread Michael Armbrust
Not yet. This is on the roadmap for Spark 1.4. On Sun, Mar 22, 2015 at 12:19 AM, deenar.toraskar deenar.toras...@db.com wrote: Hi I wanted to store DataFrames as partitioned Hive tables. Is there a way to do this via the saveAsTable call. The set of options does not seem to be documented.

Re: join two DataFrames, same column name

2015-03-22 Thread Michael Armbrust
You can include * and a column alias in the same select clause var df1 = sqlContext.sql(select *, column_id AS table1_id from table1) I'm also hoping to resolve SPARK-6376 https://issues.apache.org/jira/browse/SPARK-6376 before Spark 1.3.1 which will let you do something like: var df1 =

Re: saveAsTable broken in v1.3 DataFrames?

2015-03-21 Thread Michael Armbrust
I believe that you can get what you want by using HiveQL instead of the pure programatic API. This is a little verbose so perhaps a specialized function would also be useful here. I'm not sure I would call it saveAsExternalTable as there are also external spark sql data source tables that have

Re: Did DataFrames break basic SQLContext?

2015-03-21 Thread Michael Armbrust
Now, I am not able to directly use my RDD object and have it implicitly become a DataFrame. It can be used as a DataFrameHolder, of which I could write: rdd.toDF.registerTempTable(foo) The rational here was that we added a lot of methods to DataFrame and made the implicits more

Re: Spark SQL UDT Kryo serialization, Unable to find class

2015-03-20 Thread Michael Armbrust
? On Tue, Mar 17, 2015 at 10:19 PM, Michael Armbrust mich...@databricks.com wrote: I'll caution you that this is not a stable public API. That said, it seems that the issue is that you have not copied the jar file containing your class to all of the executors. You should not need to do

Re: Spark SQL UDT Kryo serialization, Unable to find class

2015-03-17 Thread Michael Armbrust
I'll caution you that this is not a stable public API. That said, it seems that the issue is that you have not copied the jar file containing your class to all of the executors. You should not need to do any special configuration of serialization (you can't for SQL, as we hard code it for

Re: Need Advice about reading lots of text files

2015-03-17 Thread Michael Armbrust
/c then they lose all the other goodies we have in HadoopRDD, eg. the metric tracking. I think this encourages Pat's argument that we might actually need better support for this in spark context itself? On Sat, Mar 14, 2015 at 1:11 PM, Michael Armbrust mich...@databricks.com wrote: Here

Re: Should I do spark-sql query on HDFS or apache hive?

2015-03-17 Thread Michael Armbrust
The performance has more to do with the particular format you are using, not where the metadata is coming from. Even hive tables are read from files HDFS usually. You probably should use HiveContext as its query language is more powerful than SQLContext. Also, parquet is usually the faster

Re: Should I do spark-sql query on HDFS or apache hive?

2015-03-17 Thread Michael Armbrust
lidali...@gmail.com wrote: Did you mean that parquet is faster than hive format ,and hive format is faster than hdfs ,for Spark SQL? : ) 2015-03-18 1:23 GMT+08:00 Michael Armbrust mich...@databricks.com: The performance has more to do with the particular format you are using, not where

Re: sqlContext.parquetFile doesn't work with s3n in version 1.3.0

2015-03-16 Thread Michael Armbrust
We will be including this fix in Spark 1.3.1 which we hope to make in the next week or so. On Mon, Mar 16, 2015 at 12:01 PM, Shuai Zheng szheng.c...@gmail.com wrote: I see, but this is really a… big issue. anyway for me to work around? I try to set the fs.default.name = s3n, but looks like it

Re: Need Advice about reading lots of text files

2015-03-14 Thread Michael Armbrust
Here is how I have dealt with many small text files (on s3 though this should generalize) in the past: http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201411.mbox/%3ccaaswr-58p66-es2haxh4i+bu__0rvxd2okewkly0mee8rue...@mail.gmail.com%3E FromMichael Armbrust

Re: Spark SQL 1.3 max operation giving wrong results

2015-03-14 Thread Michael Armbrust
Do you have an example that reproduces the issue? On Fri, Mar 13, 2015 at 4:12 PM, gtinside gtins...@gmail.com wrote: Hi , I am playing around with Spark SQL 1.3 and noticed that max function does not give the correct result i.e doesn't give the maximum value. The same query works fine in

Re: spark sql writing in avro

2015-03-13 Thread Michael Armbrust
BTW, I'll add that we are hoping to publish a new version of the Avro library for Spark 1.3 shortly. It should have improved support for writing data both programmatically and from SQL. On Fri, Mar 13, 2015 at 2:01 PM, Kevin Peng kpe...@gmail.com wrote: Markus, Thanks. That makes sense. I

Re: [SparkSQL] Reuse HiveContext to different Hive warehouse?

2015-03-11 Thread Michael Armbrust
That val is not really your problem. In general, there is a lot of global state throughout the hive codebase that make it unsafe to try and connect to more than one hive installation from the same JVM. On Tue, Mar 10, 2015 at 11:36 PM, Haopu Wang hw...@qilinsoft.com wrote: Hao, thanks for the

Re: ANSI Standard Supported by the Spark-SQL

2015-03-10 Thread Michael Armbrust
Spark SQL supports a subset of HiveQL: http://spark.apache.org/docs/latest/sql-programming-guide.html#compatibility-with-apache-hive On Mon, Mar 9, 2015 at 11:32 PM, Ravindra ravindra.baj...@gmail.com wrote: From the archives in this user list, It seems that Spark-SQL is yet to achieve SQL 92

Re: Spark 1.3 SQL Type Parser Changes?

2015-03-10 Thread Michael Armbrust
Thanks for reporting. This was a result of a change to our DDL parser that resulted in types becoming reserved words. I've filled a JIRA and will investigate if this is something we can fix. https://issues.apache.org/jira/browse/SPARK-6250 On Tue, Mar 10, 2015 at 1:51 PM, Nitay Joffe

Re: Spark-SQL and Hive - is Hive required?

2015-03-06 Thread Michael Armbrust
Its not required, but even if you don't have hive installed you probably still want to use the HiveContext. From earlier in that doc: In addition to the basic SQLContext, you can also create a HiveContext, which provides a superset of the functionality provided by the basic SQLContext.

Re: Spark-SQL and Hive - is Hive required?

2015-03-06 Thread Michael Armbrust
On Fri, Mar 6, 2015 at 11:58 AM, sandeep vura sandeepv...@gmail.com wrote: Can i get document how to create that setup .i mean i need hive integration on spark http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables

Re: Spark-SQL and Hive - is Hive required?

2015-03-06 Thread Michael Armbrust
Only if you want to configure the connection to an existing hive metastore. On Fri, Mar 6, 2015 at 11:08 AM, sandeep vura sandeepv...@gmail.com wrote: Hi , For creating a Hive table do i need to add hive-site.xml in spark/conf directory. On Fri, Mar 6, 2015 at 11:12 PM, Michael Armbrust

Re: Data Frame types

2015-03-06 Thread Michael Armbrust
No, the UDT API is not a public API as we have not stabilized the implementation. For this reason its only accessible to projects inside of Spark. On Fri, Mar 6, 2015 at 8:25 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi Cesar, Yes, you can define an UDT with the new DataFrame, the same

Re: Spark-SQL and Hive - is Hive required?

2015-03-06 Thread Michael Armbrust
On Fri, Mar 6, 2015 at 11:56 AM, sandeep vura sandeepv...@gmail.com wrote: Yes i want to link with existing hive metastore. Is that the right way to link to hive metastore . Yes.

Re: External Data Source in Spark

2015-03-05 Thread Michael Armbrust
Currently we have implemented External Data Source API and are able to push filters and projections. Could you provide some info on how perhaps the joins could be pushed to the original Data Source if both the data sources are from same database *.* First a disclaimer: This is an

Re: SparkSQL JSON array support

2015-03-05 Thread Michael Armbrust
You can do want with lateral view explode, but what seems to be missing is that jsonRDD converts json objects into structs (fixed keys with a fixed order) and fields in a struct are accessed using a `.` val myJson = sqlContext.jsonRDD(sc.parallelize({foo:[{bar:1},{baz:2}]} :: Nil))

Re: External Data Source in Spark

2015-03-05 Thread Michael Armbrust
One other caveat: While writing up this example I realized that we make SparkPlan private and we are already packaging 1.3-RC3... So you'll need a custom build of Spark for this to run. We'll fix this in the next release. On Thu, Mar 5, 2015 at 5:26 PM, Michael Armbrust mich...@databricks.com

Re: Save and read parquet from the same path

2015-03-04 Thread Michael Armbrust
No, this is not safe to do. On Wed, Mar 4, 2015 at 7:14 AM, Karlson ksonsp...@siberie.de wrote: Hi all, what would happen if I save a RDD via saveAsParquetFile to the same path that RDD is originally read from? Is that a safe thing to do in Pyspark? Thanks!

Re: Spark SQL Static Analysis

2015-03-04 Thread Michael Armbrust
It is somewhat out of data, but here is what we have so far: https://github.com/marmbrus/sql-typed On Wed, Mar 4, 2015 at 12:53 PM, Justin Pihony justin.pih...@gmail.com wrote: I am pretty sure that I saw a presentation where SparkSQL could be executed with static analysis, however I cannot

Re: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: pyspark on yarn

2015-03-03 Thread Michael Armbrust
In Spark 1.2 you'll have to create a partitioned hive table https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AddPartitions in order to read parquet data in this format. In Spark 1.3 the parquet data source will auto discover partitions when they are laid out

Re: LATERAL VIEW explode requests the full schema

2015-03-03 Thread Michael Armbrust
I believe that this has been optimized https://github.com/apache/spark/commit/2a36292534a1e9f7a501e88f69bfc3a09fb62cb3 in Spark 1.3. On Tue, Mar 3, 2015 at 4:36 AM, matthes matthias.diekst...@web.de wrote: I use LATERAL VIEW explode(...) to read data from a parquet-file but the full schema is

Re: Can not query TempTable registered by SQL Context using HiveContext

2015-03-03 Thread Michael Armbrust
As it says in the API docs https://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD, tables created with registerTempTable are local to the context that creates them: ... The lifetime of this temporary table is tied to the SQLContext

Re: Dataframe v/s SparkSQL

2015-03-02 Thread Michael Armbrust
They are the same. These are just different ways to construct catalyst logical plans. On Mon, Mar 2, 2015 at 12:50 PM, Manoj Samel manojsamelt...@gmail.com wrote: Is it correct to say that Spark Dataframe APIs are implemented using same execution as SparkSQL ? In other words, while the

Re: Architecture of Apache Spark SQL

2015-03-02 Thread Michael Armbrust
Here is a description of the optimizer: https://docs.google.com/a/databricks.com/document/d/1Hc_Ehtr0G8SQUg69cmViZsMi55_Kf3tISD9GPGU5M1Y/edit On Mon, Mar 2, 2015 at 10:18 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Here's the whole tech stack around it: [image: Inline image 1] For a

Re: Is SparkSQL optimizer aware of the needed data after the query?

2015-03-02 Thread Michael Armbrust
-dev +user No, lambda functions and other code are black-boxes to Spark SQL. If you want those kinds of optimizations you need to express the columns required in either SQL or the DataFrame DSL (coming in 1.3). On Mon, Mar 2, 2015 at 1:55 AM, Wail w.alkowail...@cces-kacst-mit.org wrote:

Re: SparkSQL production readiness

2015-03-02 Thread Michael Armbrust
at 5:17 PM, Michael Armbrust mich...@databricks.com wrote: We are planning to remove the alpha tag in 1.3.0. On Sat, Feb 28, 2015 at 12:30 AM, Wang, Daoyuan daoyuan.w...@intel.com wrote: Hopefully the alpha tag will be remove in 1.4.0, if the community can review code a little bit faster :P

Re: SparkSQL production readiness

2015-02-28 Thread Michael Armbrust
We are planning to remove the alpha tag in 1.3.0. On Sat, Feb 28, 2015 at 12:30 AM, Wang, Daoyuan daoyuan.w...@intel.com wrote: Hopefully the alpha tag will be remove in 1.4.0, if the community can review code a little bit faster :P Thanks, Daoyuan *From:* Ashish Mukherjee

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-02-28 Thread Michael Armbrust
I think its possible that the problem is that the scala compiler is not being loaded by the primordial classloader (but instead by some child classloader) and thus the scala reflection mirror is failing to initialize when it can't find it. Unfortunately, the only solution that I know of is to load

Re: Spark SQL Converting RDD to SchemaRDD without hardcoding a case class in scala

2015-02-27 Thread Michael Armbrust
http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema On Fri, Feb 27, 2015 at 1:39 PM, kpeng1 kpe...@gmail.com wrote: Hi All, I am currently trying to build out a spark job that would basically convert a csv file into parquet. From what I have

Re: Failed to parse Hive query

2015-02-27 Thread Michael Armbrust
Do you have a hive-site.xml file or a core-site.xml file? Perhaps something is misconfigured there? On Fri, Feb 27, 2015 at 7:17 AM, Anusha Shamanur anushas...@gmail.com wrote: Hi, I am trying to do this in spark-shell: val hiveCtx = neworg.apache.spark.sql.hive.HiveContext(sc) val

Re: Running spark function on parquet without sql

2015-02-27 Thread Michael Armbrust
From Zhan Zhang's reply, yes I still get the parquet's advantage. You will need to at least use SQL or the DataFrame API (coming in Spark 1.3) to specify the columns that you want in order to get the parquet benefits. The rest of your operations can be standard Spark. My next question is,

Re: group by order by fails

2015-02-26 Thread Michael Armbrust
Assign an alias to the count in the select clause and use that alias in the order by clause. On Wed, Feb 25, 2015 at 11:17 PM, Tridib Samanta tridib.sama...@live.com wrote: Actually I just realized , I am using 1.2.0. Thanks Tridib -- Date: Thu, 26 Feb 2015

Re: Unable to run hive queries inside spark

2015-02-25 Thread Michael Armbrust
It looks like that is getting interpreted as a local path. Are you missing a core-site.xml file to configure hdfs? On Tue, Feb 24, 2015 at 10:40 PM, kundan kumar iitr.kun...@gmail.com wrote: Hi Denny, yes the user has all the rights to HDFS. I am running all the spark operations with this

Re: Spark SQL Where IN support

2015-02-23 Thread Michael Armbrust
Yes. On Mon, Feb 23, 2015 at 1:45 AM, Paolo Platter paolo.plat...@agilelab.it wrote: I was speaking about 1.2 version of spark Paolo *Da:* Paolo Platter paolo.plat...@agilelab.it *Data invio:* ‎lunedì‎ ‎23‎ ‎febbraio‎ ‎2015 ‎10‎:‎41 *A:* user@spark.apache.org Hi guys, Is the

Re: [Spark SQL]: Convert SchemaRDD back to RDD

2015-02-23 Thread Michael Armbrust
This is not currently supported. Right now you can only get RDD[Row] as Ted suggested. On Sun, Feb 22, 2015 at 2:52 PM, Ted Yu yuzhih...@gmail.com wrote: Haven't found the method in http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD The new DataFrame has

Re: Spark 1.3 SQL Programming Guide and sql._ / sql.types._

2015-02-20 Thread Michael Armbrust
Yeah, sorry. The programming guide has not been updated for 1.3. I'm hoping to get to that this weekend / next week. On Fri, Feb 20, 2015 at 9:55 AM, Denny Lee denny.g@gmail.com wrote: Quickly reviewing the latest SQL Programming Guide

Re: SchemaRDD.select

2015-02-19 Thread Michael Armbrust
The trick here is getting the scala compiler to do the implicit conversion from Symbol - Column. In your second example, the compiler doesn't know that you are going to try and use the Seq[Symbol] as a Seq[Column] and so doesn't do the conversion. The following are other ways to provide enough

Re: JsonRDD to parquet -- data loss

2015-02-18 Thread Michael Armbrust
Concurrent inserts into the same table are not supported. I can try to make this clearer in the documentation. On Tue, Feb 17, 2015 at 8:01 PM, Vasu C vasuc.bigd...@gmail.com wrote: Hi, I am running spark batch processing job using spark-submit command. And below is my code snippet.

Re: Spark Streaming and SQL checkpoint error: (java.io.NotSerializableException: org.apache.hadoop.hive.conf.HiveConf)

2015-02-16 Thread Michael Armbrust
You probably want to mark the HiveContext as @transient as its not valid to use it on the slaves anyway. On Mon, Feb 16, 2015 at 1:58 AM, Haopu Wang hw...@qilinsoft.com wrote: I have a streaming application which registered temp table on a HiveContext for each batch duration. The

Re: How to retreive the value from sql.row by column name

2015-02-16 Thread Michael Armbrust
For efficiency the row objects don't contain the schema so you can't get the column by name directly. I usually do a select followed by pattern matching. Something like the following: caper.select('ran_id).map { case Row(ranId: String) = } On Mon, Feb 16, 2015 at 8:54 AM, Eric Bell

Re: How to retreive the value from sql.row by column name

2015-02-16 Thread Michael Armbrust
implementation for SchemaRDD does in fact allow for referencing by name and column. Why is this provided in the python implementation but not scala or java implementations? Thanks, --eric On 02/16/2015 10:46 AM, Michael Armbrust wrote: For efficiency the row objects don't contain the schema

Re: SQLContext.applySchema strictness

2015-02-15 Thread Michael Armbrust
2015 at 9:18:59 AM Michael Armbrust mich...@databricks.com wrote: Doing runtime type checking is very expensive, so we only do it when necessary (i.e. you perform an operation like adding two columns together) On Sat, Feb 14, 2015 at 2:19 AM, nitin nitin2go...@gmail.com wrote: AFAIK

Re: New ColumnType For Decimal Caching

2015-02-15 Thread Michael Armbrust
, Feb 9, 2015 at 3:16 PM, Michael Armbrust mich...@databricks.com wrote: You could add a new ColumnType https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnType.scala . PRs welcome :) On Mon, Feb 9, 2015 at 3:01 PM, Manoj Samel manojsamelt

Re: SparkSQL and star schema

2015-02-14 Thread Michael Armbrust
Yes. Though for good performance it is usually important to make sure that you have statistics for the smaller dimension tables. Today that can be done by creating them in the hive metastore and running ANALYZE TABLE table COMPUTE STATISTICS noscan. In Spark 1.3 this will happen automatically

Re: SQLContext.applySchema strictness

2015-02-14 Thread Michael Armbrust
Doing runtime type checking is very expensive, so we only do it when necessary (i.e. you perform an operation like adding two columns together) On Sat, Feb 14, 2015 at 2:19 AM, nitin nitin2go...@gmail.com wrote: AFAIK, this is the expected behavior. You have to make sure that the schema

Re: SparkSQL doesn't seem to like $'s in column names

2015-02-13 Thread Michael Armbrust
Try using `backticks` to escape non-standard characters. On Fri, Feb 13, 2015 at 11:30 AM, Corey Nolet cjno...@gmail.com wrote: I don't remember Oracle ever enforcing that I couldn't include a $ in a column name, but I also don't thinking I've ever tried. When using sqlContext.sql(...), I

Re: Columnar-Oriented RDDs

2015-02-13 Thread Michael Armbrust
Shark's in-memory code was ported to Spark SQL and is used by default when you run .cache on a SchemaRDD or CACHE TABLE. I'd also look at parquet which is more efficient and handles nested data better. On Fri, Feb 13, 2015 at 7:36 AM, Night Wolf nightwolf...@gmail.com wrote: Hi all, I'd like

Re: Task not serializable problem in the multi-thread SQL query

2015-02-12 Thread Michael Armbrust
It looks to me like perhaps your SparkContext has shut down due to too many failures. I'd look in the logs of your executors for more information. On Thu, Feb 12, 2015 at 2:34 AM, lihu lihu...@gmail.com wrote: I try to use the multi-thread to use the Spark SQL query. some sample code just

Re: Is it possible to expose SchemaRDD’s from thrift server?

2015-02-12 Thread Michael Armbrust
You can start a JDBC server with an existing context. See my answer here: http://apache-spark-user-list.1001560.n3.nabble.com/Standard-SQL-tool-access-to-SchemaRDD-td20197.html On Thu, Feb 12, 2015 at 7:24 AM, Todd Nist tsind...@gmail.com wrote: I have a question with regards to accessing

Re: How to do broadcast join in SparkSQL

2015-02-12 Thread Michael Armbrust
In Spark 1.3, parquet tables that are created through the datasources API will automatically calculate the sizeInBytes, which is used to broadcast. On Thu, Feb 12, 2015 at 12:46 PM, Dima Zhiyanov dimazhiya...@hotmail.com wrote: Hello Has Spark implemented computing statistics for Parquet

Re: Using Spark SQL for temporal data

2015-02-12 Thread Michael Armbrust
Hi Corey, I would not recommend using the CatalystScan for this. Its lower level, and not stable across releases. You should be able to do what you want with PrunedFilteredScan

Re: Extract hour from Timestamp in Spark SQL

2015-02-12 Thread Michael Armbrust
) at org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:60) ... 73 more 2015-02-13 7:05 GMT+08:00 Michael Armbrust mich...@databricks.com: Can you post the whole stacktrace? On Wed, Feb 11, 2015 at 10:23 PM, Wush Wu w...@bridgewell.com wrote: Dear

Re: Using Spark SQL for temporal data

2015-02-12 Thread Michael Armbrust
I haven't been paying close attention to the JIRA tickets for PrunedFilteredScan but I noticed some weird behavior around the filters being applied when OR expressions were used in the WHERE clause. From what I was seeing, it looks like it could be possible that the start and end ranges you

Re: feeding DataFrames into predictive algorithms

2015-02-11 Thread Michael Armbrust
It sounds like you probably want to do a standard Spark map, that results in a tuple with the structure you are looking for. You can then just assign names to turn it back into a dataframe. Assuming the first column is your label and the rest are features you can do something like this: val df

Re: can we insert and update with spark sql

2015-02-10 Thread Michael Armbrust
distributed cache using Spark SQL ? If not what do you suggest we should use for such operations... Thanks. Deb On Fri, Jul 18, 2014 at 1:00 PM, Michael Armbrust mich...@databricks.com wrote: You can do insert into. As with other SQL on HDFS systems there is no updating of data. On Jul 17

Re: spark sql registerFunction with 1.2.1

2015-02-10 Thread Michael Armbrust
The simple SQL parser doesn't yet support UDFs. Try using a HiveContext. On Tue, Feb 10, 2015 at 1:44 PM, Mohnish Kodnani mohnish.kodn...@gmail.com wrote: Hi, I am trying a very simple registerFunction and it is giving me errors. I have a parquet file which I register as temp table. Then I

Re: SQL group by on Parquet table slower when table cached

2015-02-09 Thread Michael Armbrust
of types String, Int and couple of decimal(14,4) On Mon, Feb 9, 2015 at 1:58 PM, Michael Armbrust mich...@databricks.com wrote: Is this nested data or flat data? On Mon, Feb 9, 2015 at 1:53 PM, Manoj Samel manojsamelt...@gmail.com wrote: Hi Michael, The storage tab shows the RDD resides

Re: SparkSQL DateTime

2015-02-09 Thread Michael Armbrust
The standard way to add timestamps is java.sql.Timestamp. On Mon, Feb 9, 2015 at 3:23 PM, jay vyas jayunit100.apa...@gmail.com wrote: Hi spark ! We are working on the bigpetstore-spark implementation in apache bigtop, and want to implement idiomatic date/time usage for SparkSQL. It appears

Re: SQL group by on Parquet table slower when table cached

2015-02-09 Thread Michael Armbrust
types are optimized in the in-memory storage and how are they optimized ? On Mon, Feb 9, 2015 at 2:33 PM, Michael Armbrust mich...@databricks.com wrote: You'll probably only get good compression for strings when dictionary encoding works. We don't optimize decimals in the in-memory columnar

Re: Spark SQL group by

2015-02-06 Thread Michael Armbrust
You can't use columns (timestamp) that aren't in the GROUP BY clause. Spark 1.2+ give you a better error message for this case. On Fri, Feb 6, 2015 at 3:12 PM, Mohnish Kodnani mohnish.kodn...@gmail.com wrote: Hi, i am trying to issue a sql query against a parquet file and am getting errors

Re: SQL group by on Parquet table slower when table cached

2015-02-06 Thread Michael Armbrust
Check the storage tab. Does the table actually fit in memory? Otherwise you are rebuilding column buffers in addition to reading the data off of the disk. On Fri, Feb 6, 2015 at 4:39 PM, Manoj Samel manojsamelt...@gmail.com wrote: Spark 1.2 Data stored in parquet table (large number of rows)

Re: How to get Hive table schema using Spark SQL or otherwise

2015-02-04 Thread Michael Armbrust
sqlContext.table(tableName).schema() On Wed, Feb 4, 2015 at 1:07 PM, Ayoub benali.ayoub.i...@gmail.com wrote: Given a hive context you could execute: hiveContext.sql(describe TABLE_NAME) you would get the name of the fields and their types 2015-02-04 21:47 GMT+01:00 nitinkak001 [hidden

Re: Setting maxPrintString in Spark Repl to view SQL query plans

2015-02-03 Thread Michael Armbrust
I'll add i usually just do println(query.queryExecution) On Tue, Feb 3, 2015 at 11:34 AM, Michael Armbrust mich...@databricks.com wrote: You should be able to do something like: sbt -Dscala.repl.maxprintstring=64000 hive/console Here's an overview of catalyst: https://docs.google.com

Re: Setting maxPrintString in Spark Repl to view SQL query plans

2015-02-03 Thread Michael Armbrust
You should be able to do something like: sbt -Dscala.repl.maxprintstring=64000 hive/console Here's an overview of catalyst: https://docs.google.com/a/databricks.com/document/d/1Hc_Ehtr0G8SQUg69cmViZsMi55_Kf3tISD9GPGU5M1Y/edit#heading=h.vp2tej73rtm2 On Tue, Feb 3, 2015 at 1:37 AM, Mick Davies

Re: Why is DecimalType separate from DataType ?

2015-01-30 Thread Michael Armbrust
You are grabbing the singleton, not the class. You need to specify the precision (i.e. DecimalType.Unlimited or DecimalType(precision, scale)) On Fri, Jan 30, 2015 at 2:23 PM, Manoj Samel manojsamelt...@gmail.com wrote: Spark 1.2 While building schemaRDD using StructType xxx = new

Re: [hive context] Unable to query array once saved as parquet

2015-01-30 Thread Michael Armbrust
Is it possible that your schema contains duplicate columns or column with spaces in the name? The parquet library will often give confusing error messages in this case. On Fri, Jan 30, 2015 at 10:33 AM, Ayoub benali.ayoub.i...@gmail.com wrote: Hello, I have a problem when querying, with a

Re: SQL query over (Long, JSON string) tuples

2015-01-29 Thread Michael Armbrust
Eventually it would be nice for us to have some sort of function to do the conversion you are talking about on a single column, but for now I usually hack it as you suggested: val withId = origRDD.map { case (id, str) = s{id:$id, ${str.trim.drop(1)} } val table = sqlContext.jsonRDD(withId) On

Re: Hive on Spark vs. SparkSQL using Hive ?

2015-01-29 Thread Michael Armbrust
I would characterize the difference as follows: Spark SQL http://spark.apache.org/docs/latest/sql-programming-guide.html is the native engine for processing structured data using Spark. In contrast to Shark or Hive on Spark is has its own optimizer that was designed for the RDD model. It is

Re: schemaRDD.saveAsParquetFile creates large number of small parquet files ...

2015-01-29 Thread Michael Armbrust
You can use coalesce or repartition to control the number of file output by any Spark operation. On Thu, Jan 29, 2015 at 9:27 AM, Manoj Samel manojsamelt...@gmail.com wrote: Spark 1.2 on Hadoop 2.3 Read one big csv file, create a schemaRDD on it and saveAsParquetFile. It creates a large

<    2   3   4   5   6   7   8   9   10   11   >