Announcing Spark SQL

2014-03-26 Thread Michael Armbrust
Hey Everyone, This already went out to the dev list, but I wanted to put a pointer here as well to a new feature we are pretty excited about for Spark 1.0. http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.html Michael

Re: Announcing Spark SQL

2014-03-26 Thread Michael Armbrust
Any plans to make the SQL typesafe using something like Slick ( http://slick.typesafe.com/) I would really like to do something like that, and maybe we will in a couple of months. However, in the near term, I think the top priorities are going to be performance and stability. Michael

Re: Announcing Spark SQL

2014-03-29 Thread Michael Armbrust
On Fri, Mar 28, 2014 at 9:53 PM, Rohit Rai ro...@tuplejump.com wrote: Upon discussion with couple of our clients, it seems the reason they would prefer using hive is that they have already invested a lot in it. Mostly in UDFs and HiveQL. 1. Are there any plans to develop the SQL Parser to

Re: Shouldn't the UNION of SchemaRDDs produce SchemaRDD ?

2014-03-31 Thread Michael Armbrust
* unionAll preserve duplicate v/s union that does not This is true, if you want to eliminate duplicate items you should follow the union with a distinct() * SQL union and unionAll result in same output format i.e. another SQL v/s different RDD types here. * Understand the existing union

Re: groupBy RDD does not have grouping column ?

2014-03-31 Thread Michael Armbrust
This is similar to how SQL works, items in the GROUP BY clause are not included in the output by default. You will need to include 'a in the second parameter list (which is similar to the SELECT clause) as well if you want it included in the output. On Sun, Mar 30, 2014 at 9:52 PM, Manoj Samel

Re: Error in SparkSQL Example

2014-03-31 Thread Michael Armbrust
val people: RDD[Person] // An RDD of case class objects, from the first example. is just a placeholder to avoid cluttering up each example with the same code for creating an RDD. The : RDD[People] is just there to let you know the expected type of the variable 'people'. Perhaps there is a

Re: Spark SQL transformations, narrow vs. wide

2014-04-03 Thread Michael Armbrust
I'm sorry, but I don't really understand what you mean when you say wide in this context. For a HashJoin, the only dependencies of the produced RDD are the two input RDDs. For BroadcastNestedLoopJoin The only dependence will be on the streamed RDD. The other RDD will be distributed to all

Re: Example of creating expressions for SchemaRDD methods

2014-04-04 Thread Michael Armbrust
In such construct, each operator builds on the previous one, including any materialized results etc. If I use a SQL for each of them, I suspect the later SQLs will not leverage the earlier SQLs by any means - hence these will be inefficient to first approach. Let me know if this is not

Re: Example of creating expressions for SchemaRDD methods

2014-04-04 Thread Michael Armbrust
Minor typo in the example. The first SELECT statement should actually be: sql(SELECT * FROM src) Where `src` is a HiveTable with schema (key INT value STRING). On Fri, Apr 4, 2014 at 11:35 AM, Michael Armbrust mich...@databricks.comwrote: In such construct, each operator builds

Re: Best way to turn an RDD back into a SchemaRDD

2014-04-09 Thread Michael Armbrust
Good question. This is something we wanted to fix, but unfortunately I'm not sure how to do it without changing the API to RDD, which is undesirable now that the 1.0 branch has been cut. We should figure something out though for 1.1. I've created https://issues.apache.org/jira/browse/SPARK-1460

Re: How do I access the SPARK SQL

2014-04-24 Thread Michael Armbrust
You shouldn't need to set SPARK_HIVE=true unless you want to use the JavaHiveContext. You should be able to access org.apache.spark.sql.api.java.JavaSQLContext with the default build. How are you building your application? Michael On Thu, Apr 24, 2014 at 9:17 AM, Andrew Or

Re: How do I access the SPARK SQL

2014-04-24 Thread Michael Armbrust
Oh, and you'll also need to add a dependency on spark-sql_2.10. On Thu, Apr 24, 2014 at 10:13 AM, Michael Armbrust mich...@databricks.comwrote: Yeah, you'll need to run `sbt publish-local` to push the jars to your local maven repository (~/.m2) and then depend on version 1.0.0-SNAPSHOT

Re: Using Spark in IntelliJ Scala Console

2014-04-26 Thread Michael Armbrust
The spark-shell is a special version of the Scala REPL that serves the classes created for each line over HTTP. Do you know if the InteliJ Spark console is just the normal Scala repl in a GUI wrapper, or if it is something else entirely? If its the former, perhaps it might be possible to tell

Re: Question about Transforming huge files from Local to HDFS

2014-04-26 Thread Michael Armbrust
1) When I tried to read a huge file from local and used Avro + Parquet to transform it into Parquet format and stored them to HDFS using the API saveAsNewAPIHadoopFile, the JVM would be out of memory, because the file is too large to be contained by memory. How much memory are you giving the

Re: Using Spark in IntelliJ Scala Console

2014-04-26 Thread Michael Armbrust
using sbt console it didn't work either. It only worked in spark project's bin/spark-shell Is there a way to customize the SBT console of a project listing spark as a dependency? Thx, Jon On Sat, Apr 26, 2014 at 9:42 PM, Michael Armbrust mich...@databricks.comwrote: The spark

Re: Using Spark in IntelliJ Scala Console

2014-04-26 Thread Michael Armbrust
You'll also need: libraryDependencies += org.apache.spark %% spark-repl % spark version On Sat, Apr 26, 2014 at 3:32 PM, Michael Armbrust mich...@databricks.comwrote: This is a little bit of a hack, but might work for you. You'll need to be on sbt 0.13.2. connectInput in run := true

Re: sbt/sbt run command returns a JVM problem

2014-05-03 Thread Michael Armbrust
The problem is probably not with the JVM running sbt but with the one that sbt is forking to run your program. See here for the relevant option: https://github.com/apache/spark/blob/master/project/SparkBuild.scala#L186 You might try starting sbt with no arguments (to bring up the sbt console).

Re: Schema view of HadoopRDD

2014-05-16 Thread Michael Armbrust
Here is a link with more info: http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html On Wed, May 7, 2014 at 10:09 PM, Debasish Das debasish.da...@gmail.comwrote: Hi, For each line that we read as textLine from HDFS, we have a schema..if there is an API that takes the

Re: Using Spark to analyze complex JSON

2014-05-24 Thread Michael Armbrust
But going back to your presented pattern, I have a question. Say your data does have a fixed structure, but some of the JSON values are lists. How would you map that to a SchemaRDD? (I didn’t notice any list values in the CandyCrush example.) Take the likes field from my original example:

Re: Using Spark to analyze complex JSON

2014-05-25 Thread Michael Armbrust
On Sat, May 24, 2014 at 11:47 PM, Mayur Rustagi mayur.rust...@gmail.comwrote: Is the in-memory columnar store planned as part of SparkSQL ? This has already been ported from Shark, and is used when you run cacheTable. Also will both HiveQL SQLParser be kept updated? Yeah, we need to

Re: Re: spark table to hive table

2014-05-28 Thread Michael Armbrust
On Tue, May 27, 2014 at 6:08 PM, JaeBoo Jung itsjb.j...@samsung.com wrote: I already tried HiveContext as well as SqlContext. But it seems that Spark's HiveContext is not completely same as Apache Hive. For example, SQL like 'SELECT RANK() OVER(ORDER BY VAL1 ASC) FROM TEST LIMIT 10' works

Re: Spark SQL JDBC Connectivity

2014-05-29 Thread Michael Armbrust
On Wed, May 28, 2014 at 11:39 PM, Venkat Subramanian vsubr...@gmail.comwrote: We are planning to use the latest Spark SQL on RDDs. If a third party application wants to connect to Spark via JDBC, does Spark SQL have support? (We want to avoid going though Shark/Hive JDBC layer as we need good

Re: Re: how to construct a ClassTag object as a method parameter in Java

2014-06-03 Thread Michael Armbrust
in Java 2014-06-03 -- bluejoe2008 *From:* Michael Armbrust mich...@databricks.com *Date:* 2014-06-03 10:09 *To:* user user@spark.apache.org *Subject:* Re: how to construct a ClassTag object as a method parameter in Java What version of Spark are you using? Also

Re: SchemaRDD's saveAsParquetFile() throws java.lang.IncompatibleClassChangeError

2014-06-03 Thread Michael Armbrust
This thread seems to be about the same issue: https://www.mail-archive.com/user@spark.apache.org/msg04403.html On Tue, Jun 3, 2014 at 12:25 PM, k.tham kevins...@gmail.com wrote: I'm trying to save an RDD as a parquet file through the saveAsParquestFile() api, With code that looks something

Re: Is Spark-1.0.0 not backward compatible with Shark-0.9.1 ?

2014-06-06 Thread Michael Armbrust
There is not an official updated version of Shark for Spark-1.0 (though you might check out the untested spark-1.0 branch on the github). You can also check out the preview release of Shark that runs on Spark SQL: https://github.com/amplab/shark/tree/sparkSql Michael On Fri, Jun 6, 2014 at

Re: cache spark sql parquet file in memory?

2014-06-07 Thread Michael Armbrust
Not a stupid question! I would like to be able to do this. For now, you might try writing the data to tachyon http://tachyon-project.org/ instead of HDFS. This is untested though, please report any issues you run into. Michael On Fri, Jun 6, 2014 at 8:13 PM, Xu (Simon) Chen xche...@gmail.com

Re: Spark SQL JDBC Connectivity and more

2014-06-09 Thread Michael Armbrust
[Venkat] Are you saying - pull in the SharkServer2 code in my standalone spark application (as a part of the standalone application process), pass in the spark context of the standalone app to SharkServer2 Sparkcontext at startup and viola we get a SQL/JDBC interfaces for the RDDs of the

Re: Spark SQL standalone application compile error

2014-06-09 Thread Michael Armbrust
You need to add the following to your sbt file: libraryDependencies += org.apache.spark %% spark-sql % 1.0.0 On Mon, Jun 9, 2014 at 9:25 PM, shlee0605 shlee0...@gmail.com wrote: I am having some trouble with compiling Spark standalone application that uses new Spark SQL feature. I have used

Re: Spark SQL incorrect result on GROUP BY query

2014-06-11 Thread Michael Armbrust
I'd try rerunning with master. It is likely you are running into SPARK-1994 https://issues.apache.org/jira/browse/SPARK-1994. Michael On Wed, Jun 11, 2014 at 3:01 AM, Pei-Lun Lee pl...@appier.com wrote: Hi, I am using spark 1.0.0 and found in spark sql some queries use GROUP BY give weird

Re: Spark SQL incorrect result on GROUP BY query

2014-06-12 Thread Michael Armbrust
Thanks for verifying! On Thu, Jun 12, 2014 at 12:28 AM, Pei-Lun Lee pl...@appier.com wrote: I reran with master and looks like it is fixed. 2014-06-12 1:26 GMT+08:00 Michael Armbrust mich...@databricks.com: I'd try rerunning with master. It is likely you are running into SPARK-1994

Re: Spark SQL - input command in web ui/event log

2014-06-12 Thread Michael Armbrust
Yeah, we should probably add that. Feel free to file a JIRA. You can get it manually by calling sc.setJobDescription with the query text before running the query. Michael On Thu, Jun 12, 2014 at 5:49 PM, shlee0605 shlee0...@gmail.com wrote: In shark, the input SQL string was shown at the

Re: SparkSQL registerAsTable - No TypeTag available Error

2014-06-14 Thread Michael Armbrust
Can you maybe attach the full scala file? On Sat, Jun 14, 2014 at 5:03 AM, premdass premdas...@yahoo.co.in wrote: Hi, I am trying to run the spark sql example provided on the example https://spark.apache.org/docs/latest/sql-programming-guide.html as a standalone program. When i try to

Re: SparkSQL registerAsTable - No TypeTag available Error

2014-06-14 Thread Michael Armbrust
Actually, are you defining Person as an inner class? You might be running into this: http://stackoverflow.com/questions/18866866/why-there-is-no-typetag-available-in-nested-instantiations-when-interpreted-by On Sat, Jun 14, 2014 at 1:51 PM, Michael Armbrust mich...@databricks.com wrote: Can

Re: How to add jar with SparkSQL HiveContext?

2014-06-17 Thread Michael Armbrust
Can you try this in master? You are likely running into SPARK-2128 https://issues.apache.org/jira/browse/SPARK-2128. Michael On Mon, Jun 16, 2014 at 11:41 PM, Earthson earthson...@gmail.com wrote: I have a problem with add jar command hql(add jar /.../xxx.jar) Error: Exception in

Re: Spark sql unable to connect to db2 hive metastore

2014-06-17 Thread Michael Armbrust
First a clarification: Spark SQL does not talk to HiveServer2, as that JDBC interface is for retrieving results from queries that are executed using Hive. Instead Spark SQL will execute queries itself by directly accessing your data using Spark. Spark SQL's Hive module can use JDBC to connect

Re: Spark streaming RDDs to Parquet records

2014-06-17 Thread Michael Armbrust
If you convert the data to a SchemaRDD you can save it as Parquet: http://spark.apache.org/docs/latest/sql-programming-guide.html#using-parquet On Tue, Jun 17, 2014 at 11:47 PM, Padmanabhan, Mahesh (contractor) mahesh.padmanab...@twc-contractor.com wrote: Thanks Krishna. Seems like you have

Re: Spark SQL: No function to evaluate expression

2014-06-17 Thread Michael Armbrust
Yeah, sorry that error message is not very intuitive. There is already a JIRA open to make it better: SPARK-2059 https://issues.apache.org/jira/browse/SPARK-2059 Also, a bug has been fixed in master regarding attributes that contain _. So if you are running 1.0 you might try upgrading. On

Re: get schema from SchemaRDD

2014-06-18 Thread Michael Armbrust
We just merged a feature into master that lets you print the schema or view it as a string (printSchema() and schemaTreeString on SchemaRDD). There is also this JIRA targeting 1.1 for presenting a nice programatic API for this information: https://issues.apache.org/jira/browse/SPARK-2179 On

Re: Performance problems on SQL JOIN

2014-06-21 Thread Michael Armbrust
Its probably because our LEFT JOIN performance isn't super great ATM since we'll use a nest loop join. Sorry! We are aware of the problem and there is a JIRA to let us do this with a HashJoin instead. If you are feeling brave you might try pulling in the related PR.

Re: Where Can I find the full documentation for Spark SQL?

2014-06-26 Thread Michael Armbrust
The programming guide is part of the standard documentation: http://spark.apache.org/docs/latest/sql-programming-guide.html Regarding specifics about SQL syntax and functions, I'd recommend using a HiveContext and the HQL method currently, as that is much more complete than the basic SQL parser

Re: SparkSQL- Nested CaseClass Parquet failure

2014-06-26 Thread Michael Armbrust
Nested parquet is not supported in 1.0, but is part of the upcoming 1.0.1 release. On Thu, Jun 26, 2014 at 3:03 PM, anthonyjschu...@gmail.com anthonyjschu...@gmail.com wrote: Hello all: I am attempting to persist a parquet file comprised of a SchemaRDD of nested case classes... Creating

Re: LIMIT with offset in SQL queries

2014-07-03 Thread Michael Armbrust
Doing an offset is actually pretty expensive in a distributed query engine, so in many cases it probably makes sense to just collect and then perform the offset as you are doing now. This is unless the offset is very large. Another limitation here is that HiveQL does not support OFFSET. That

Re: Which version of Hive support Spark Shark

2014-07-03 Thread Michael Armbrust
Spark SQL is based on Hive 0.12.0. On Thu, Jul 3, 2014 at 2:29 AM, Ravi Prasad raviprasa...@gmail.com wrote: Hi , Can any one please help me to understand which version of Hive support Spark and Shark -- -- Regards, RAVI PRASAD. T

Re: Spark SQL user defined functions

2014-07-04 Thread Michael Armbrust
On Fri, Jul 4, 2014 at 1:59 AM, Martin Gammelsæter martingammelsae...@gmail.com wrote: is there any way to write user defined functions for Spark SQL? This is coming in Spark 1.1. There is a work in progress PR here: https://github.com/apache/spark/pull/1063 If you have a hive context, you

Re: Spark SQL user defined functions

2014-07-04 Thread Michael Armbrust
Sweet. Any idea about when this will be merged into master? It is probably going to be a couple of weeks. There is a fair amount of cleanup that needs to be done. It works though and we used it in most of the demos at the spark summit. Mostly I just need to add tests and move it out of

Re: SQL FIlter of tweets (json) running on Disk

2014-07-04 Thread Michael Armbrust
sqlContext.jsonFile(data.json) Is this already available in the master branch??? Yes, and it will be available in the soon to come 1.0.1 release. But the question about the use a combination of resources (Memory processing Disk processing) still remains. This code should work

Re: LIMIT with offset in SQL queries

2014-07-04 Thread Michael Armbrust
well, reduces the iteration. I think a offset solution based on windowsing directly would be useful. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Fri, Jul 4, 2014 at 2:00 AM, Michael Armbrust mich...@databricks.com

Re: SparkSQL with sequence file RDDs

2014-07-07 Thread Michael Armbrust
I haven't heard any reports of this yet, but I don't see any reason why it wouldn't work. You'll need to manually convert the objects that come out of the sequence file into something where SparkSQL can detect the schema (i.e. scala case classes or java beans) before you can register the RDD as a

Re: Data loading to Parquet using spark

2014-07-07 Thread Michael Armbrust
SchemaRDDs, provided by Spark SQL, have a saveAsParquetFile command. You can turn a normal RDD into a SchemaRDD using the techniques described here: http://spark.apache.org/docs/latest/sql-programming-guide.html This should work with Impala, but if you run into any issues please let me know.

Re: SparkSQL with sequence file RDDs

2014-07-07 Thread Michael Armbrust
We know Scala 2.11 has remove the limitation of parameter number, but Spark 1.0 is not compatible with it. So now we are considering use java beans instead of Scala case classes. You can also manually create a class that implements scala's Product interface. Finally, SPARK-2179

Re: Spark SQL user defined functions

2014-07-07 Thread Michael Armbrust
the 1.1 release. On Mon, Jul 7, 2014 at 12:25 AM, Martin Gammelsæter martingammelsae...@gmail.com wrote: Hi again, and thanks for your reply! On Fri, Jul 4, 2014 at 8:45 PM, Michael Armbrust mich...@databricks.com wrote: Sweet. Any idea about when this will be merged into master

Re: SparkSQL with sequence file RDDs

2014-07-07 Thread Michael Armbrust
Here is a simple example of registering an RDD of Products as a table. It is important that all of the fields are val defined in the constructor and that you implement canEqual, productArity and productElement. class Record(val x1: String) extends Product with Serializable { def canEqual(that:

Re: SparkSQL - Partitioned Parquet

2014-07-07 Thread Michael Armbrust
The only partitioning that is currently supported is through Hive partitioned tables. Supporting this for parquet as well is on our radar, but probably won't happen for 1.1. On Sun, Jul 6, 2014 at 10:00 PM, Raffael Marty ra...@pixlcloud.com wrote: Does SparkSQL support partitioned parquet

Re: Spark SQL registerAsTable requires a Java Class?

2014-07-08 Thread Michael Armbrust
This is on the roadmap for the next release (1.1) JIRA: SPARK-2179 https://issues.apache.org/jira/browse/SPARK-2179 On Mon, Jul 7, 2014 at 11:48 PM, Ionized ioni...@gmail.com wrote: The Java API requires a Java Class to register as table. // Apply a schema to an RDD of JavaBeans and

Re: Spark SQL registerAsTable requires a Java Class?

2014-07-08 Thread Michael Armbrust
you have an estimate on when some will be available?) On Tue, Jul 8, 2014 at 12:24 AM, Michael Armbrust mich...@databricks.com wrote: This is on the roadmap for the next release (1.1) JIRA: SPARK-2179 https://issues.apache.org/jira/browse/SPARK-2179 On Mon, Jul 7, 2014 at 11:48 PM

Re: [Spark SQL]: Convert SchemaRDD back to RDD

2014-07-08 Thread Michael Armbrust
On Tue, Jul 8, 2014 at 12:43 PM, Pierre B pierre.borckm...@realimpactanalytics.com wrote: 1/ Is there a way to convert a SchemaRDD (for instance loaded from a parquet file) back to a RDD of a given case class? There may be someday, but doing so will either require a lot of reflection or a

Re: Spark SQL - java.lang.NoClassDefFoundError: Could not initialize class $line10.$read$

2014-07-09 Thread Michael Armbrust
At first glance that looks like an error with the class shipping in the spark shell. (i.e. the line that you type into the spark shell are compiled into classes and then shipped to the executors where they run). Are you able to run other spark examples with closures in the same shell? Michael

Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Michael Armbrust
On Thu, Jul 10, 2014 at 2:08 PM, Jerry Lam chiling...@gmail.com wrote: For the curious mind, the dataset is about 200-300GB and we are using 10 machines for this benchmark. Given the env is equal between the two experiments, why pure spark is faster than SparkSQL? There is going to be some

Re: Potential bugs in SparkSQL

2014-07-10 Thread Michael Armbrust
Hi Jerry, Thanks for reporting this. It would be helpful if you could provide the output of the following command: println(hql(select s.id from m join s on (s.id=m_id)).queryExecution) Michael On Thu, Jul 10, 2014 at 8:15 AM, Jerry Lam chiling...@gmail.com wrote: Hi Spark developers, I

Re: SparkSQL - Language Integrated query - OR clause and IN clause

2014-07-10 Thread Michael Armbrust
I'll add that the SQL parser is very limited right now, and that you'll get much wider coverage using hql inside of HiveContext. We are working on bringing sql() much closer to SQL-92 though in the future. On Thu, Jul 10, 2014 at 7:28 AM, premdass premdas...@yahoo.co.in wrote: Thanks Takuya .

Re: EC2 Cluster script. Shark install fails

2014-07-10 Thread Michael Armbrust
There is no version of Shark that is compatible with Spark 1.0, however, Spark SQL does come included automatically. More information here: http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html

Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Michael Armbrust
SerDes overhead, then there must be something additional that SparkSQL adds to the overall overheads that Hive doesn't have. Best Regards, Jerry On Thu, Jul 10, 2014 at 7:11 PM, Michael Armbrust mich...@databricks.com wrote: On Thu, Jul 10, 2014 at 2:08 PM, Jerry Lam chiling

Re: Potential bugs in SparkSQL

2014-07-10 Thread Michael Armbrust
, Jerry On Thu, Jul 10, 2014 at 7:16 PM, Michael Armbrust mich...@databricks.com wrote: Hi Jerry, Thanks for reporting this. It would be helpful if you could provide the output of the following command: println(hql(select s.id from m join s on (s.id=m_id)).queryExecution) Michael

Re: SparkSql newbie problems with nested selects

2014-07-13 Thread Michael Armbrust
Hi Andy, The SQL parser is pretty basic (we plan to improve this for the 1.2 release). In this case I think part of the problem is that one of your variables is count, which is a reserved word. Unfortunately, we don't have the ability to escape identifiers at this point. However, I did manage

Re: Supported SQL syntax in Spark SQL

2014-07-13 Thread Michael Armbrust
Are you sure the code running on the cluster has been updated? We recently optimized the execution of LIKE queries that can be evaluated without using full regular expressions. So it's possible this error is due to missing functionality on the executors. How can I trace this down for a bug

Re: Supported SQL syntax in Spark SQL

2014-07-14 Thread Michael Armbrust
You can find the parser here: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SqlParser.scala In general the hive parser provided by HQL is much more complete at the moment. Long term we will likely stop using parser combinators and either

Re: Spark SQL 1.0.1 error on reading fixed length byte array

2014-07-14 Thread Michael Armbrust
This is not supported yet, but there is a PR open to fix it: https://issues.apache.org/jira/browse/SPARK-2446 On Mon, Jul 14, 2014 at 4:17 AM, Pei-Lun Lee pl...@appier.com wrote: Hi, I am using spark-sql 1.0.1 to load parquet files generated from method described in:

Re: Catalyst dependency on Spark Core

2014-07-14 Thread Michael Armbrust
Yeah, sadly this dependency was introduced when someone consolidated the logging infrastructure. However, the dependency should be very small and thus easy to remove, and I would like catalyst to be usable outside of Spark. A pull request to make this possible would be welcome. Ideally, we'd

Re: Nested Query With Spark SQL(1.0.1)

2014-07-14 Thread Michael Armbrust
What sort of nested query are you talking about? Right now we only support nested queries in the FROM clause. I'd like to add support for other cases in the future. On Sun, Jul 13, 2014 at 4:11 AM, anyweil wei...@gmail.com wrote: Or is it supported? I know I could doing it myself with

Re: Repeated data item search with Spark SQL(1.0.1)

2014-07-14 Thread Michael Armbrust
Handling of complex types is somewhat limited in SQL at the moment. It'll be more complete if you use HiveQL. That said, the problem here is you are calling .name on an array. You need to pick an item from the array (using [..]) or use something like a lateral view explode. On Sat, Jul 12,

Change when loading/storing String data using Parquet

2014-07-14 Thread Michael Armbrust
I just wanted to send out a quick note about a change in the handling of strings when loading / storing data using parquet and Spark SQL. Before, Spark SQL did not support binary data in Parquet, so all binary blobs were implicitly treated as Strings. 9fe693

Re: jsonRDD: NoSuchMethodError

2014-07-14 Thread Michael Armbrust
Have you upgraded the cluster where you are running this 1.0.1 as well? A NoSuchMethodError almost always means that the class files available at runtime are different from those that were there when you compiled your program. On Mon, Jul 14, 2014 at 7:06 PM, SK skrishna...@gmail.com wrote:

Re: Spark SQL throws ClassCastException on first try; works on second

2014-07-15 Thread Michael Armbrust
You might be hitting SPARK-1994 https://issues.apache.org/jira/browse/SPARK-1994, which is fixed in 1.0.1. On Mon, Jul 14, 2014 at 11:16 PM, Nick Chammas nicholas.cham...@gmail.com wrote: I’m running this query against RDD[Tweet], where Tweet is a simple case class with 4 fields.

Re: Nested Query With Spark SQL(1.0.1)

2014-07-15 Thread Michael Armbrust
In general this should be supported using [] to access array data and . to access nested fields. Is there something you are trying that isn't working? On Mon, Jul 14, 2014 at 11:25 PM, anyweil wei...@gmail.com wrote: I mean the query on the nested data such as JSON, not the nested query,

Re: Repeated data item search with Spark SQL(1.0.1)

2014-07-15 Thread Michael Armbrust
Sorry for the trouble. There are two issues here: - Parsing of repeated nested (i.e. something[0].field) is not supported in the plain SQL parser. SPARK-2096 https://issues.apache.org/jira/browse/SPARK-2096 - Resolution is broken in the HiveQL parser. SPARK-2483

Re: Spark SQL 1.0.1 error on reading fixed length byte array

2014-07-15 Thread Michael Armbrust
https://issues.apache.org/jira/browse/SPARK-2446? 2014-07-15 3:54 GMT+08:00 Michael Armbrust mich...@databricks.com: This is not supported yet, but there is a PR open to fix it: https://issues.apache.org/jira/browse/SPARK-2446 On Mon, Jul 14, 2014 at 4:17 AM, Pei-Lun Lee pl...@appier.com wrote

Re: Store one to many relation ship in parquet file with spark sql

2014-07-15 Thread Michael Armbrust
Make the Array a Seq. On Tue, Jul 15, 2014 at 7:12 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi all, How should I store a one to many relationship using spark sql and parquet format. For example I the following case class case class Person(key: String, name: String, friends:

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Michael Armbrust
Are you registering multiple RDDs of case classes as tables concurrently? You are possibly hitting SPARK-2178 https://issues.apache.org/jira/browse/SPARK-2178 which is caused by SI-6240 https://issues.scala-lang.org/browse/SI-6240. On Tue, Jul 15, 2014 at 10:49 AM, Keith Simmons

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Michael Armbrust
, Jul 15, 2014 at 11:14 AM, Michael Armbrust mich...@databricks.com wrote: Are you registering multiple RDDs of case classes as tables concurrently? You are possibly hitting SPARK-2178 which is caused by SI-6240. On Tue, Jul 15, 2014 at 10:49 AM, Keith Simmons keith.simm...@gmail.com

Re: Repeated data item search with Spark SQL(1.0.1)

2014-07-15 Thread Michael Armbrust
powerful SQL support borrowed from Hive. Can you shed some lights on this when you get a minute? Thanks, Jerry On Tue, Jul 15, 2014 at 4:32 PM, Michael Armbrust mich...@databricks.com wrote: No, that is why I included the link to SPARK-2096 https://issues.apache.org/jira/browse/SPARK

Re: Read all the columns from a file in spark sql

2014-07-16 Thread Michael Armbrust
I think what you might be looking for is the ability to programmatically specify the schema, which is coming in 1.1. Here's the JIRA: SPARK-2179 https://issues.apache.org/jira/browse/SPARK-2179 On Wed, Jul 16, 2014 at 8:24 AM, pandees waran pande...@gmail.com wrote: Hi, I am newbie to spark

Re: Ambiguous references to id : what does it mean ?

2014-07-16 Thread Michael Armbrust
Yes, but if both tagCollection and selectedVideos have a column named id then Spark SQL does not know which one you are referring to in the where clause. Here's an example with aliases: val x = testData2.as('x) val y = testData2.as('y) val join = x.join(y, Inner, Some(x.a.attr ===

Re: Simple record matching using Spark SQL

2014-07-16 Thread Michael Armbrust
What if you just run something like: *sc.textFile(hdfs://localhost:54310/user/hduser/file1.csv).count()* On Wed, Jul 16, 2014 at 10:37 AM, Sarath Chandra sarathchandra.jos...@algofusiontech.com wrote: Yes Soumya, I did it. First I tried with the example available in the documentation

Re: Repeated data item search with Spark SQL(1.0.1)

2014-07-16 Thread Michael Armbrust
the logical plan, it is executed in spark regardless of dialect although the execution might be different for the same query. Best Regards, Jerry On Tue, Jul 15, 2014 at 6:22 PM, Michael Armbrust mich...@databricks.com wrote: hql and sql are just two different dialects for interacting

Re: ClassNotFoundException: $line11.$read$ when loading an HDFS text file with SparkQL in spark-shell

2014-07-16 Thread Michael Armbrust
Note that runnning a simple map+reduce job on the same hdfs files with the same installation works fine: Did you call collect() on the totalLength? Otherwise nothing has actually executed.

Re: ClassNotFoundException: $line11.$read$ when loading an HDFS text file with SparkQL in spark-shell

2014-07-16 Thread Michael Armbrust
Oh, I'm sorry... reduce is also an operation On Wed, Jul 16, 2014 at 3:37 PM, Michael Armbrust mich...@databricks.com wrote: Note that runnning a simple map+reduce job on the same hdfs files with the same installation works fine: Did you call collect() on the totalLength? Otherwise

Re: ClassNotFoundException: $line11.$read$ when loading an HDFS text file with SparkQL in spark-shell

2014-07-16 Thread Michael Armbrust
H, it could be some weirdness with classloaders / Mesos / spark sql? I'm curious if you would hit an error if there were no lambda functions involved. Perhaps if you load the data using jsonFile or parquetFile. Either way, I'd file a JIRA. Thanks! On Jul 16, 2014 6:48 PM, Svend

Re: Release date for new pyspark

2014-07-16 Thread Michael Armbrust
You should try cleaning and then building. We have recently hit a bug in the scala compiler that sometimes causes non-clean builds to fail. On Wed, Jul 16, 2014 at 7:56 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Yeah, we try to have a regular 3 month release cycle; see

Re: Simple record matching using Spark SQL

2014-07-17 Thread Michael Armbrust
$CLASSPATH $CONFIG_OPTS test.Test4 spark://master:7077 /usr/local/spark-1.0.1-bin-hadoop1 hdfs://master:54310/user/hduser/file1.csv hdfs://master:54310/user/hduser/file2.csv* ~Sarath On Wed, Jul 16, 2014 at 8:14 PM, Michael Armbrust mich...@databricks.com wrote: What if you just run

Re: class after join

2014-07-17 Thread Michael Armbrust
If you intern the string it will be more efficient, but still significantly more expensive than the class based approach. ** VERY EXPERIMENTAL ** We are working with EPFL on a lightweight syntax for naming the results of spark transformations in scala (and are going to make it interoperate with

Re: Apache kafka + spark + Parquet

2014-07-17 Thread Michael Armbrust
We don't have support for partitioned parquet yet. There is a JIRA here: https://issues.apache.org/jira/browse/SPARK-2406 On Thu, Jul 17, 2014 at 5:00 PM, Tathagata Das tathagata.das1...@gmail.com wrote: val kafkaStream = KafkaUtils.createStream(... ) // see the example in my previous post

Re: incompatible local class serialVersionUID with spark Shark

2014-07-18 Thread Michael Armbrust
There is no version of shark that works with spark 1.0. More details about the path forward here: http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html On Jul 18, 2014 4:53 AM, Megane1994 leumenilari...@yahoo.fr wrote: Hello, I want to run

Re: TreeNodeException: No function to evaluate expression. type: AttributeReference, tree: id#0 on GROUP BY

2014-07-18 Thread Michael Armbrust
Sorry for the non-obvious error message. It is not valid SQL to include attributes in the select clause unless they are also in the group by clause or are inside of an aggregate function. On Jul 18, 2014 5:12 AM, Martin Gammelsæter martingammelsae...@gmail.com wrote: Hi again! I am having

Re: Need help on Spark UDF (Join) Performance tuning .

2014-07-18 Thread Michael Armbrust
It's likely that since your UDF is a black box to hive's query optimizer that it must choose a less efficient join algorithm that passes all possible matches to your function for comparison. This will happen any time your UDF touches attributes from both sides of the join. In general you can

Re: spark1.0.1 spark sql error java.lang.NoClassDefFoundError: Could not initialize class $line11.$read$

2014-07-18 Thread Michael Armbrust
Can you tell us more about your environment. Specifically, are you also running on Mesos? On Jul 18, 2014 12:39 AM, Victor Sheng victorsheng...@gmail.com wrote: when I run a query to a hadoop file. mobile.registerAsTable(mobile) val count = sqlContext.sql(select count(1) from mobile) res5:

Re: Cannot connect to hive metastore

2014-07-18 Thread Michael Armbrust
See the section on advanced dependency management: http://spark.apache.org/docs/latest/submitting-applications.html On Jul 17, 2014 10:53 PM, linkpatrickliu linkpatrick...@live.com wrote: Seems like the mysql connector jar is not included in the classpath. Where can I set the jar to the

Re: can we insert and update with spark sql

2014-07-18 Thread Michael Armbrust
You can do insert into. As with other SQL on HDFS systems there is no updating of data. On Jul 17, 2014 1:26 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Is this what you are looking for? https://spark.apache.org/docs/1.0.0/api/java/org/apache/spark/sql/parquet/InsertIntoParquetTable.html

Re: spark sql left join gives KryoException: Buffer overflow

2014-07-18 Thread Michael Armbrust
Unfortunately, this is a query where we just don't have an efficiently implementation yet. You might try switching the table order. Here is the JIRA for doing something more efficient: https://issues.apache.org/jira/browse/SPARK-2212 On Fri, Jul 18, 2014 at 7:05 AM, Pei-Lun Lee

Re: registerAsTable can't be compiled

2014-07-19 Thread Michael Armbrust
Can you provide the code? Is Record a case class? and is it defined as a top level object? Also have you done import sqlContext._? On Sat, Jul 19, 2014 at 3:39 AM, junius junius.z...@gmail.com wrote: Hello, I write code to practice Spark Sql based on latest Spark version. But I get

Re: spark sql left join gives KryoException: Buffer overflow

2014-07-21 Thread Michael Armbrust
When SPARK-2211 is done, will spark sql automatically choose join algorithms? Is there some way to manually hint the optimizer? Ideally we will select the best algorithm for you. We are also considering ways to allow the user to hint.

  1   2   3   4   5   6   7   8   9   10   >