Re: Spark SQL check if query is completed (pyspark)

2014-09-07 Thread Michael Armbrust
Sometimes the underlying Hive code will also print exceptions during successful execution (for example CREATE TABLE IF NOT EXISTS). If there is actually a problem Spark SQL should throw an exception. What is the command you are running and what is the error you are seeing? On Sat, Sep 6, 2014

Re: SchemaRDD - Parquet - insertInto makes many files

2014-09-04 Thread Michael Armbrust
It depends on the RDD in question exactly where the work will be done. I believe that if you do a repartition(1) instead of a coalesce it will force a shuffle so the work will be done distributed and then a single node will read that shuffled data and write it out. If you want to write to a

Re: SparkSQL TPC-H query 3 joining multiple tables

2014-09-03 Thread Michael Armbrust
Are you using SQLContext or HiveContext? The default sql dialect in HiveContext (HiveQL) is a little more complete and might be a better place to start. On Wed, Sep 3, 2014 at 2:12 AM, Samay smilingsa...@gmail.com wrote: Hi, I am trying to run query 3 from the TPC-H benchmark using

Re: flattening a list in spark sql

2014-09-02 Thread Michael Armbrust
Check out LATERAL VIEW explode: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView On Tue, Sep 2, 2014 at 1:26 PM, gtinside gtins...@gmail.com wrote: Hi , I am using jsonRDD in spark sql and having trouble iterating through array inside the json object. Please refer

Re: flattening a list in spark sql

2014-09-02 Thread Michael Armbrust
Yes you can. HiveContext's functionality is a strict superset of SQLContext. On Tue, Sep 2, 2014 at 6:35 PM, gtinside gtins...@gmail.com wrote: Thanks . I am not using hive context . I am loading data from Cassandra and then converting it into json and then querying it through SQL context .

Re: Spark and Shark

2014-09-01 Thread Michael Armbrust
I don't believe that Shark works with Spark 1.0. Have you considered trying Spark SQL? On Mon, Sep 1, 2014 at 8:21 AM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi, I have installed Spark 1.0.2 and Shark 0.9.2 on Hadoop 2.4.1 (by compiling from source). spark: 1.0.2

Re: Spark SQL : how to find element where a field is in a given set

2014-08-29 Thread Michael Armbrust
:24 AM, Michael Armbrust mich...@databricks.com wrote: You don't need the Seq, as in is a variadic function. personTable.where('name in (foo, bar)) On Thu, Aug 28, 2014 at 3:09 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi all, What is the expression that I should use with spark sql

Re: Spark SQL : how to find element where a field is in a given set

2014-08-29 Thread Michael Armbrust
What version are you using? On Fri, Aug 29, 2014 at 2:22 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Still not working for me. I got a compilation error : *value in is not a member of Symbol.* Any ideas ? On Fri, Aug 29, 2014 at 9:46 AM, Michael Armbrust mich...@databricks.com wrote

Re: Spark Hive max key length is 767 bytes

2014-08-29 Thread Michael Armbrust
Spark SQL is based on Hive 12. They must have changed the maximum key size between 12 and 13. On Fri, Aug 29, 2014 at 4:38 AM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi, Tried the same thing in HIVE directly without issue: HIVE: hive create table test_datatype2

Re: Spark SQL : how to find element where a field is in a given set

2014-08-29 Thread Michael Armbrust
This feature was not part of that version. It will be in 1.1. On Fri, Aug 29, 2014 at 12:33 PM, Jaonary Rabarisoa jaon...@gmail.com wrote: 1.0.2 On Friday, August 29, 2014, Michael Armbrust mich...@databricks.com wrote: What version are you using? On Fri, Aug 29, 2014 at 2:22 AM

Re: Change delimiter when collecting SchemaRDD

2014-08-28 Thread Michael Armbrust
The comma is just the way the default toString works for Row objects. Since SchemaRDDs are also RDDs, you can do arbitrary transformations on the Row objects that are returned. For example, if you'd rather the delimiter was '|': sql(SELECT * FROM src).map(_.mkString(|)).collect() On Thu, Aug

Re: Does HiveContext support Parquet?

2014-08-27 Thread Michael Armbrust
I'll note the parquet jars are included by default in 1.1 On Wed, Aug 27, 2014 at 11:53 AM, lyc yanchen@huawei.com wrote: Thanks a lot. Finally, I can create parquet table using your command -driver-class-path. I am using hadoop 2.3. Now, I will try to load data into the tables.

Re: How to get prerelease thriftserver working?

2014-08-27 Thread Michael Armbrust
I would expect that to work. What exactly is the error? On Wed, Aug 27, 2014 at 6:02 AM, Matt Chu m...@kabam.com wrote: (apologies for sending this twice, first via nabble; didn't realize it wouldn't get forwarded) Hey, I know it's not officially released yet, but I'm trying to understand

Re: hive on spark yarn

2014-08-27 Thread Michael Armbrust
You need to have the datanuclus jars on your classpath. It is not okay to merge them into an uber jar. On Wed, Aug 27, 2014 at 1:44 AM, centerqi hu cente...@gmail.com wrote: Hi all When I run a simple SQL, encountered the following error. hive:0.12(metastore in mysql) hadoop 2.4.1

Re: SparkSQL returns ArrayBuffer for fields of type Array

2014-08-27 Thread Michael Armbrust
Arrays in the JVM are also mutable. However, you should not be relying on the exact type here. The only promise is that you will get back something of type Seq[_]. On Wed, Aug 27, 2014 at 4:27 PM, Du Li l...@yahoo-inc.com wrote: Hi, Michael. I used HiveContext to create a table with a

Re: SparkSQL returns ArrayBuffer for fields of type Array

2014-08-27 Thread Michael Armbrust
? From: Michael Armbrust mich...@databricks.com Date: Wednesday, August 27, 2014 at 5:21 PM To: Du Li l...@yahoo-inc.com Cc: user@spark.apache.org user@spark.apache.org Subject: Re: SparkSQL returns ArrayBuffer for fields of type Array Arrays in the JVM are also mutable. However, you should

Re: Spark QL and protobuf schema

2014-08-25 Thread Michael Armbrust
wrote: ok i'll try. happen to do that a lot to other tools. So I am guessing you are saying if i wanted to do it now, i'd start against https://github.com/apache/spark/tree/branch-1.1 and PR against it? On Thu, Aug 21, 2014 at 12:28 AM, Michael Armbrust mich...@databricks.com wrote: I do

Re: HiveContext ouput log file

2014-08-25 Thread Michael Armbrust
Just like with normal Spark Jobs, that command returns an RDD that contains the lineage for computing the answer but does not actually compute the answer. You'll need to run collect() on the RDD in order to get the result. On Mon, Aug 25, 2014 at 11:46 AM, S Malligarjunan

Re: SPARK Hive Context UDF Class Not Found Exception,

2014-08-25 Thread Michael Armbrust
Which version of Spark SQL are you using? Several issues with custom hive UDFs have been fixed in 1.1. On Mon, Aug 25, 2014 at 9:57 AM, S Malligarjunan smalligarju...@yahoo.com.invalid wrote: Hello All, I have added a jar from S3 instance into classpath, i have tried following options 1.

Re: [Spark SQL] How to select first row in each GROUP BY group?

2014-08-25 Thread Michael Armbrust
In our case, the ROW has about 80 columns which exceeds the case class limit.​ Starting with Spark 1.1 you'll be able to also use the applySchema API https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L126 .

Re: Spark SQL: Caching nested structures extremely slow

2014-08-25 Thread Michael Armbrust
One useful thing to do when you run into unexpected slowness is to run 'jstack' a few times on the driver and executors and see if there is any particular hotspot in the Spark SQL code. Also, it seems like a better option here might be to use the new applySchema API

Re: Writeup on Spark SQL with GDELT

2014-08-25 Thread Michael Armbrust
Thanks for this very thorough write-up and for continuing to update it as you progress! As I said in the other thread it would be great to do a little profiling to see if we can get to the heart of the slowness with nested case classes (very little optimization has been done in this code path).

Re: Merging two Spark SQL tables?

2014-08-25 Thread Michael Armbrust
SO I tried the above (why doesn't union or ++ have the same behavior btw?) I don't think there is a good reason for this. I'd open a JIRA. and it works, but is slow because the original Rdds are not cached and files must be read from disk. I also discovered you can recover the

Re: Spark QL and protobuf schema

2014-08-25 Thread Michael Armbrust
is approved for contribution, obviously PR process will be followed. On Mon, Aug 25, 2014 at 11:57 AM, Michael Armbrust mich...@databricks.com wrote: In general all PRs should be made against master. When necessary, we can back port them to the 1.1 branch as well. However, since we

Re: Storage Handlers in Spark SQL

2014-08-25 Thread Michael Armbrust
- dev list + user list You should be able to query Spark SQL using JDBC, starting with the 1.1 release. There is some documentation is the repo https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md#running-the-thrift-jdbc-server, and we'll update the official docs once the

Re: Merging two Spark SQL tables?

2014-08-21 Thread Michael Armbrust
I believe this should work if you run srdd1.unionAll(srdd2). Both RDDs must have the same schema. On Wed, Aug 20, 2014 at 11:30 PM, Evan Chan velvia.git...@gmail.com wrote: Is it possible to merge two cached Spark SQL tables into a single table so it can queried with one SQL statement? ie,

Re: Spark QL and protobuf schema

2014-08-21 Thread Michael Armbrust
I do not know of any existing way to do this. It should be possible using the new public API for applying schema (will be available in 1.1) to an RDD. Basically you'll need to convert the proto buff records into rows, and also create a StructType that represents the schema. With this two things

Re: Is Spark SQL Thrift Server part of the 1.0.2 release

2014-08-20 Thread Michael Armbrust
will be invoked from a middle tier webapp. I am thinking to use the Hive JDBC driver. Thanks, Ken *From:* Michael Armbrust [mailto:mich...@databricks.com] *Sent:* Wednesday, August 20, 2014 9:38 AM *To:* Tam, Ken K *Cc:* user@spark.apache.org *Subject:* Re: Is Spark SQL Thrift Server

Re: Does HiveContext support Parquet?

2014-08-16 Thread Michael Armbrust
Hi to all, sorry for not being fully on topic but I have 2 quick questions about Parquet tables registered in Hive/sparq: Using HiveQL to CREATE TABLE will add a table to the metastore / warehouse exactly as it would in hive. Registering is a purely temporary operation that lives with the

Re: SparkSQL Hive partitioning support

2014-08-13 Thread Michael Armbrust
This is not supported at the moment. There are no concrete plans at the moment to support it though the programatic API, but it should work using SQL as you suggested. On Wed, Aug 13, 2014 at 8:22 AM, Silvio Fiorito silvio.fior...@granturing.com wrote: Using the SchemaRDD *insertInto*

Re: Support for ORC Table in Shark/Spark

2014-08-13 Thread Michael Armbrust
I would expect this to work with Spark SQL (available in 1.0+) but there is a JIRA open to confirm this works SPARK-2883 https://issues.apache.org/jira/browse/SPARK-2883. On Mon, Aug 11, 2014 at 10:23 PM, vinay.kash...@socialinfra.net wrote: Hi all, Is it possible to use table with ORC

Re: How to direct insert vaules into SparkSQL tables?

2014-08-13 Thread Michael Armbrust
I do not believe this is true. If you are using a hive context you should be able to register an RDD as a temporary table and then use INSERT INTO https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueriesto add data to a hive

Re: Spark SQL JDBC

2014-08-12 Thread Michael Armbrust
Hive pulls in a ton of dependencies that we were afraid would break existing spark applications. For this reason all hive submodules are optional. On Tue, Aug 12, 2014 at 7:43 AM, John Omernik j...@omernik.com wrote: Yin helped me with that, and I appreciate the onlist followup. A few

Re: CDH5, HiveContext, Parquet

2014-08-10 Thread Michael Armbrust
I imagine it's not the only instance of this kind of problem people will ever encounter. Can you rebuild Spark with this particular release of Hive? Unfortunately the Hive APIs that we use change to much from release to release to make this possible. There is a JIRA for compiling Spark SQL

Re: Spark SQL JSON dataset query nested datastructures

2014-08-10 Thread Michael Armbrust
Sounds like you need to use lateral view with explode https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView, which is supported in Spark SQL's HiveContext. On Sat, Aug 9, 2014 at 6:43 PM, Sathish Kumaran Vairavelu vsathishkuma...@gmail.com wrote: I have a simple JSON

Re: Trying to make sense of the actual executed code

2014-08-06 Thread Michael Armbrust
This is maybe not exactly what you are asking for, but you might consider looking at the queryExecution (a developer API that shows how the query is analyzed / executed) sql(...).queryExecution On Wed, Aug 6, 2014 at 3:55 PM, Tom thubregt...@gmail.com wrote: Hi, I am trying to look at for

Re: Spark SQL Thrift Server

2014-08-05 Thread Michael Armbrust
We are working on an overhaul of the docs before the 1.1 release. In the mean time try: CACHE TABLE tableName. On Tue, Aug 5, 2014 at 9:02 AM, John Omernik j...@omernik.com wrote: I gave things working on my cluster with the sparksql thrift server. (Thank you Yin Huai at Databricks!) That

Re: spark sql left join gives KryoException: Buffer overflow

2014-08-05 Thread Michael Armbrust
For outer joins I'd recommend upgrading to master or waiting for a 1.1 release candidate (which should be out this week). On Tue, Aug 5, 2014 at 7:38 AM, Dima Zhiyanov dimazhiya...@hotmail.com wrote: I am also experiencing this kryo buffer problem. My join is left outer with under 40mb on the

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Michael Armbrust
Is this on 1.0.1? I'd suggest running this on master or the 1.1-RC which should be coming out this week. Pyspark did not have good support for nested data previously. If you still encounter issues using a more recent version, please file a JIRA. Thanks! On Tue, Aug 5, 2014 at 11:55 AM, Brad

Re: [PySpark] [SQL] Going from RDD[dict] to SchemaRDD

2014-08-05 Thread Michael Armbrust
Maybe; I’m not sure just yet. Basically, I’m looking for something functionally equivalent to this: sqlContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x))) In other words, given an RDD of JSON-serializable Python dictionaries, I want to be able to infer a schema that is guaranteed to

Re: SQLCtx cacheTable

2014-08-04 Thread Michael Armbrust
when it knows the exact size of the data) There is a discussion here about trying to improve this: https://issues.apache.org/jira/browse/SPARK-2650 On Sun, Aug 3, 2014 at 11:35 PM, Gurvinder Singh gurvinder.si...@uninett.no wrote: On 08/03/2014 02:33 AM, Michael Armbrust wrote: I am

Re: Substring in Spark SQL

2014-08-04 Thread Michael Armbrust
Yeah, there will likely be a community preview build soon for the 1.1 release. Benchmarking that will both give you better performance and help QA the release. Bonus points if you turn on codegen for Spark SQL (experimental feature) when benchmarking and report bugs: SET spark.sql.codegen=true

Re: SQLCtx cacheTable

2014-08-02 Thread Michael Armbrust
for caching/storing. So I am wondering how the memory is handled in cacheTable case. Does it reserve the memory storage and other parts run out of their memory. I also tries to change the spark.storage.memoryFraction but that did not help. - Gurvinder On 08/01/2014 08:42 AM, Michael Armbrust wrote

Re: Spark-sql with Tachyon cache

2014-08-02 Thread Michael Armbrust
We are investigating various ways to integrate with Tachyon. I'll note that you can already use saveAsParquetFile and parquetFile(...).registerAsTable(tableName) (soon to be registerTempTable in Spark 1.1) to store data into tachyon and query it with Spark SQL. On Fri, Aug 1, 2014 at 1:42 AM,

Re: Spark SQL Query Plan optimization

2014-08-02 Thread Michael Armbrust
The number of partitions (which decides the number of tasks) is fixed after any shuffle and can be configured using 'spark.sql.shuffle.partitions' though SQLConf (i.e. sqlContext.set(...) or SET spark.sql.shuffle.partitions=... in sql) It is possible we will auto select this based on statistics

Re: Spark SQL, Parquet and Impala

2014-08-01 Thread Michael Armbrust
So is the only issue that impala does not see changes until you refresh the table? This sounds like a configuration that needs to be changed on the impala side. On Fri, Aug 1, 2014 at 7:20 AM, Patrick McGloin mcgloin.patr...@gmail.com wrote: Sorry, sent early, wasn't finished typing. CREATE

Re: SchemaRDD select expression

2014-07-31 Thread Michael Armbrust
The performance should be the same using the DSL or SQL strings. On Thu, Jul 31, 2014 at 2:36 PM, Buntu Dev buntu...@gmail.com wrote: I was not sure if registerAsTable() and then query against that table have additional performance impact and if DSL eliminates that. On Thu, Jul 31, 2014 at

Re: SQLCtx cacheTable

2014-07-31 Thread Michael Armbrust
cacheTable uses a special columnar caching technique that is optimized for SchemaRDDs. It something similar to MEMORY_ONLY_SER but not quite. You can specify the persistence level on the SchemaRDD itself and register that as a temporary table, however it is likely you will not get as good

Re: Spark SQL JDBC Connectivity

2014-07-30 Thread Michael Armbrust
Very cool. Glad you found a solution that works. On Wed, Jul 30, 2014 at 1:04 PM, Venkat Subramanian vsubr...@gmail.com wrote: For the time being, we decided to take a different route. We created a Rest API layer in our app and allowed SQL query passing via the Rest. Internally we pass that

Re: HiveContext is creating metastore warehouse locally instead of in hdfs

2014-07-29 Thread Michael Armbrust
The warehouse and the metastore directories are two different things. The metastore holds the schema information about the tables and will by default be a local directory. With javax.jdo.option.ConnectionURL you can configure it to be something like mysql. The warehouse directory is the default

Re: SparkSQL extensions

2014-07-27 Thread Michael Armbrust
/adam/rdd/RegionJoin.scala I was thinking to provide an improved version of method partitionAndJoin from the ADAM implementation above On Sat, Jul 26, 2014 at 12:37 PM, Michael Armbrust mich...@databricks.com wrote: A very simple example of adding a new operator to Spark SQL: https

Re: Convert raw data files to Parquet format

2014-07-23 Thread Michael Armbrust
Take a look at the programming guide for spark sql: http://spark.apache.org/docs/latest/sql-programming-guide.html On Wed, Jul 23, 2014 at 11:09 AM, buntu buntu...@gmail.com wrote: I wanted to experiment with using Parquet data with SparkSQL. I got some tab-delimited files and wanted to know

Re: spark sql left join gives KryoException: Buffer overflow

2014-07-21 Thread Michael Armbrust
When SPARK-2211 is done, will spark sql automatically choose join algorithms? Is there some way to manually hint the optimizer? Ideally we will select the best algorithm for you. We are also considering ways to allow the user to hint.

Re: registerAsTable can't be compiled

2014-07-19 Thread Michael Armbrust
Can you provide the code? Is Record a case class? and is it defined as a top level object? Also have you done import sqlContext._? On Sat, Jul 19, 2014 at 3:39 AM, junius junius.z...@gmail.com wrote: Hello, I write code to practice Spark Sql based on latest Spark version. But I get

Re: incompatible local class serialVersionUID with spark Shark

2014-07-18 Thread Michael Armbrust
There is no version of shark that works with spark 1.0. More details about the path forward here: http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html On Jul 18, 2014 4:53 AM, Megane1994 leumenilari...@yahoo.fr wrote: Hello, I want to run

Re: TreeNodeException: No function to evaluate expression. type: AttributeReference, tree: id#0 on GROUP BY

2014-07-18 Thread Michael Armbrust
Sorry for the non-obvious error message. It is not valid SQL to include attributes in the select clause unless they are also in the group by clause or are inside of an aggregate function. On Jul 18, 2014 5:12 AM, Martin Gammelsæter martingammelsae...@gmail.com wrote: Hi again! I am having

Re: Need help on Spark UDF (Join) Performance tuning .

2014-07-18 Thread Michael Armbrust
It's likely that since your UDF is a black box to hive's query optimizer that it must choose a less efficient join algorithm that passes all possible matches to your function for comparison. This will happen any time your UDF touches attributes from both sides of the join. In general you can

Re: spark1.0.1 spark sql error java.lang.NoClassDefFoundError: Could not initialize class $line11.$read$

2014-07-18 Thread Michael Armbrust
Can you tell us more about your environment. Specifically, are you also running on Mesos? On Jul 18, 2014 12:39 AM, Victor Sheng victorsheng...@gmail.com wrote: when I run a query to a hadoop file. mobile.registerAsTable(mobile) val count = sqlContext.sql(select count(1) from mobile) res5:

Re: Cannot connect to hive metastore

2014-07-18 Thread Michael Armbrust
See the section on advanced dependency management: http://spark.apache.org/docs/latest/submitting-applications.html On Jul 17, 2014 10:53 PM, linkpatrickliu linkpatrick...@live.com wrote: Seems like the mysql connector jar is not included in the classpath. Where can I set the jar to the

Re: can we insert and update with spark sql

2014-07-18 Thread Michael Armbrust
You can do insert into. As with other SQL on HDFS systems there is no updating of data. On Jul 17, 2014 1:26 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Is this what you are looking for? https://spark.apache.org/docs/1.0.0/api/java/org/apache/spark/sql/parquet/InsertIntoParquetTable.html

Re: spark sql left join gives KryoException: Buffer overflow

2014-07-18 Thread Michael Armbrust
Unfortunately, this is a query where we just don't have an efficiently implementation yet. You might try switching the table order. Here is the JIRA for doing something more efficient: https://issues.apache.org/jira/browse/SPARK-2212 On Fri, Jul 18, 2014 at 7:05 AM, Pei-Lun Lee

Re: Simple record matching using Spark SQL

2014-07-17 Thread Michael Armbrust
$CLASSPATH $CONFIG_OPTS test.Test4 spark://master:7077 /usr/local/spark-1.0.1-bin-hadoop1 hdfs://master:54310/user/hduser/file1.csv hdfs://master:54310/user/hduser/file2.csv* ~Sarath On Wed, Jul 16, 2014 at 8:14 PM, Michael Armbrust mich...@databricks.com wrote: What if you just run

Re: class after join

2014-07-17 Thread Michael Armbrust
If you intern the string it will be more efficient, but still significantly more expensive than the class based approach. ** VERY EXPERIMENTAL ** We are working with EPFL on a lightweight syntax for naming the results of spark transformations in scala (and are going to make it interoperate with

Re: Apache kafka + spark + Parquet

2014-07-17 Thread Michael Armbrust
We don't have support for partitioned parquet yet. There is a JIRA here: https://issues.apache.org/jira/browse/SPARK-2406 On Thu, Jul 17, 2014 at 5:00 PM, Tathagata Das tathagata.das1...@gmail.com wrote: val kafkaStream = KafkaUtils.createStream(... ) // see the example in my previous post

Re: Read all the columns from a file in spark sql

2014-07-16 Thread Michael Armbrust
I think what you might be looking for is the ability to programmatically specify the schema, which is coming in 1.1. Here's the JIRA: SPARK-2179 https://issues.apache.org/jira/browse/SPARK-2179 On Wed, Jul 16, 2014 at 8:24 AM, pandees waran pande...@gmail.com wrote: Hi, I am newbie to spark

Re: Ambiguous references to id : what does it mean ?

2014-07-16 Thread Michael Armbrust
Yes, but if both tagCollection and selectedVideos have a column named id then Spark SQL does not know which one you are referring to in the where clause. Here's an example with aliases: val x = testData2.as('x) val y = testData2.as('y) val join = x.join(y, Inner, Some(x.a.attr ===

Re: Simple record matching using Spark SQL

2014-07-16 Thread Michael Armbrust
What if you just run something like: *sc.textFile(hdfs://localhost:54310/user/hduser/file1.csv).count()* On Wed, Jul 16, 2014 at 10:37 AM, Sarath Chandra sarathchandra.jos...@algofusiontech.com wrote: Yes Soumya, I did it. First I tried with the example available in the documentation

Re: Repeated data item search with Spark SQL(1.0.1)

2014-07-16 Thread Michael Armbrust
the logical plan, it is executed in spark regardless of dialect although the execution might be different for the same query. Best Regards, Jerry On Tue, Jul 15, 2014 at 6:22 PM, Michael Armbrust mich...@databricks.com wrote: hql and sql are just two different dialects for interacting

Re: ClassNotFoundException: $line11.$read$ when loading an HDFS text file with SparkQL in spark-shell

2014-07-16 Thread Michael Armbrust
Note that runnning a simple map+reduce job on the same hdfs files with the same installation works fine: Did you call collect() on the totalLength? Otherwise nothing has actually executed.

Re: ClassNotFoundException: $line11.$read$ when loading an HDFS text file with SparkQL in spark-shell

2014-07-16 Thread Michael Armbrust
Oh, I'm sorry... reduce is also an operation On Wed, Jul 16, 2014 at 3:37 PM, Michael Armbrust mich...@databricks.com wrote: Note that runnning a simple map+reduce job on the same hdfs files with the same installation works fine: Did you call collect() on the totalLength? Otherwise

Re: ClassNotFoundException: $line11.$read$ when loading an HDFS text file with SparkQL in spark-shell

2014-07-16 Thread Michael Armbrust
H, it could be some weirdness with classloaders / Mesos / spark sql? I'm curious if you would hit an error if there were no lambda functions involved. Perhaps if you load the data using jsonFile or parquetFile. Either way, I'd file a JIRA. Thanks! On Jul 16, 2014 6:48 PM, Svend

Re: Release date for new pyspark

2014-07-16 Thread Michael Armbrust
You should try cleaning and then building. We have recently hit a bug in the scala compiler that sometimes causes non-clean builds to fail. On Wed, Jul 16, 2014 at 7:56 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Yeah, we try to have a regular 3 month release cycle; see

Re: Spark SQL throws ClassCastException on first try; works on second

2014-07-15 Thread Michael Armbrust
You might be hitting SPARK-1994 https://issues.apache.org/jira/browse/SPARK-1994, which is fixed in 1.0.1. On Mon, Jul 14, 2014 at 11:16 PM, Nick Chammas nicholas.cham...@gmail.com wrote: I’m running this query against RDD[Tweet], where Tweet is a simple case class with 4 fields.

Re: Nested Query With Spark SQL(1.0.1)

2014-07-15 Thread Michael Armbrust
In general this should be supported using [] to access array data and . to access nested fields. Is there something you are trying that isn't working? On Mon, Jul 14, 2014 at 11:25 PM, anyweil wei...@gmail.com wrote: I mean the query on the nested data such as JSON, not the nested query,

Re: Repeated data item search with Spark SQL(1.0.1)

2014-07-15 Thread Michael Armbrust
Sorry for the trouble. There are two issues here: - Parsing of repeated nested (i.e. something[0].field) is not supported in the plain SQL parser. SPARK-2096 https://issues.apache.org/jira/browse/SPARK-2096 - Resolution is broken in the HiveQL parser. SPARK-2483

Re: Spark SQL 1.0.1 error on reading fixed length byte array

2014-07-15 Thread Michael Armbrust
https://issues.apache.org/jira/browse/SPARK-2446? 2014-07-15 3:54 GMT+08:00 Michael Armbrust mich...@databricks.com: This is not supported yet, but there is a PR open to fix it: https://issues.apache.org/jira/browse/SPARK-2446 On Mon, Jul 14, 2014 at 4:17 AM, Pei-Lun Lee pl...@appier.com wrote

Re: Store one to many relation ship in parquet file with spark sql

2014-07-15 Thread Michael Armbrust
Make the Array a Seq. On Tue, Jul 15, 2014 at 7:12 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi all, How should I store a one to many relationship using spark sql and parquet format. For example I the following case class case class Person(key: String, name: String, friends:

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Michael Armbrust
Are you registering multiple RDDs of case classes as tables concurrently? You are possibly hitting SPARK-2178 https://issues.apache.org/jira/browse/SPARK-2178 which is caused by SI-6240 https://issues.scala-lang.org/browse/SI-6240. On Tue, Jul 15, 2014 at 10:49 AM, Keith Simmons

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Michael Armbrust
, Jul 15, 2014 at 11:14 AM, Michael Armbrust mich...@databricks.com wrote: Are you registering multiple RDDs of case classes as tables concurrently? You are possibly hitting SPARK-2178 which is caused by SI-6240. On Tue, Jul 15, 2014 at 10:49 AM, Keith Simmons keith.simm...@gmail.com

Re: Repeated data item search with Spark SQL(1.0.1)

2014-07-15 Thread Michael Armbrust
powerful SQL support borrowed from Hive. Can you shed some lights on this when you get a minute? Thanks, Jerry On Tue, Jul 15, 2014 at 4:32 PM, Michael Armbrust mich...@databricks.com wrote: No, that is why I included the link to SPARK-2096 https://issues.apache.org/jira/browse/SPARK

Re: Supported SQL syntax in Spark SQL

2014-07-14 Thread Michael Armbrust
You can find the parser here: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SqlParser.scala In general the hive parser provided by HQL is much more complete at the moment. Long term we will likely stop using parser combinators and either

Re: Spark SQL 1.0.1 error on reading fixed length byte array

2014-07-14 Thread Michael Armbrust
This is not supported yet, but there is a PR open to fix it: https://issues.apache.org/jira/browse/SPARK-2446 On Mon, Jul 14, 2014 at 4:17 AM, Pei-Lun Lee pl...@appier.com wrote: Hi, I am using spark-sql 1.0.1 to load parquet files generated from method described in:

Re: Catalyst dependency on Spark Core

2014-07-14 Thread Michael Armbrust
Yeah, sadly this dependency was introduced when someone consolidated the logging infrastructure. However, the dependency should be very small and thus easy to remove, and I would like catalyst to be usable outside of Spark. A pull request to make this possible would be welcome. Ideally, we'd

Re: Nested Query With Spark SQL(1.0.1)

2014-07-14 Thread Michael Armbrust
What sort of nested query are you talking about? Right now we only support nested queries in the FROM clause. I'd like to add support for other cases in the future. On Sun, Jul 13, 2014 at 4:11 AM, anyweil wei...@gmail.com wrote: Or is it supported? I know I could doing it myself with

Re: Repeated data item search with Spark SQL(1.0.1)

2014-07-14 Thread Michael Armbrust
Handling of complex types is somewhat limited in SQL at the moment. It'll be more complete if you use HiveQL. That said, the problem here is you are calling .name on an array. You need to pick an item from the array (using [..]) or use something like a lateral view explode. On Sat, Jul 12,

Change when loading/storing String data using Parquet

2014-07-14 Thread Michael Armbrust
I just wanted to send out a quick note about a change in the handling of strings when loading / storing data using parquet and Spark SQL. Before, Spark SQL did not support binary data in Parquet, so all binary blobs were implicitly treated as Strings. 9fe693

Re: jsonRDD: NoSuchMethodError

2014-07-14 Thread Michael Armbrust
Have you upgraded the cluster where you are running this 1.0.1 as well? A NoSuchMethodError almost always means that the class files available at runtime are different from those that were there when you compiled your program. On Mon, Jul 14, 2014 at 7:06 PM, SK skrishna...@gmail.com wrote:

Re: SparkSql newbie problems with nested selects

2014-07-13 Thread Michael Armbrust
Hi Andy, The SQL parser is pretty basic (we plan to improve this for the 1.2 release). In this case I think part of the problem is that one of your variables is count, which is a reserved word. Unfortunately, we don't have the ability to escape identifiers at this point. However, I did manage

Re: Supported SQL syntax in Spark SQL

2014-07-13 Thread Michael Armbrust
Are you sure the code running on the cluster has been updated? We recently optimized the execution of LIKE queries that can be evaluated without using full regular expressions. So it's possible this error is due to missing functionality on the executors. How can I trace this down for a bug

Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Michael Armbrust
On Thu, Jul 10, 2014 at 2:08 PM, Jerry Lam chiling...@gmail.com wrote: For the curious mind, the dataset is about 200-300GB and we are using 10 machines for this benchmark. Given the env is equal between the two experiments, why pure spark is faster than SparkSQL? There is going to be some

Re: Potential bugs in SparkSQL

2014-07-10 Thread Michael Armbrust
Hi Jerry, Thanks for reporting this. It would be helpful if you could provide the output of the following command: println(hql(select s.id from m join s on (s.id=m_id)).queryExecution) Michael On Thu, Jul 10, 2014 at 8:15 AM, Jerry Lam chiling...@gmail.com wrote: Hi Spark developers, I

Re: SparkSQL - Language Integrated query - OR clause and IN clause

2014-07-10 Thread Michael Armbrust
I'll add that the SQL parser is very limited right now, and that you'll get much wider coverage using hql inside of HiveContext. We are working on bringing sql() much closer to SQL-92 though in the future. On Thu, Jul 10, 2014 at 7:28 AM, premdass premdas...@yahoo.co.in wrote: Thanks Takuya .

Re: EC2 Cluster script. Shark install fails

2014-07-10 Thread Michael Armbrust
There is no version of Shark that is compatible with Spark 1.0, however, Spark SQL does come included automatically. More information here: http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html

Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Michael Armbrust
SerDes overhead, then there must be something additional that SparkSQL adds to the overall overheads that Hive doesn't have. Best Regards, Jerry On Thu, Jul 10, 2014 at 7:11 PM, Michael Armbrust mich...@databricks.com wrote: On Thu, Jul 10, 2014 at 2:08 PM, Jerry Lam chiling

Re: Potential bugs in SparkSQL

2014-07-10 Thread Michael Armbrust
, Jerry On Thu, Jul 10, 2014 at 7:16 PM, Michael Armbrust mich...@databricks.com wrote: Hi Jerry, Thanks for reporting this. It would be helpful if you could provide the output of the following command: println(hql(select s.id from m join s on (s.id=m_id)).queryExecution) Michael

Re: Spark SQL - java.lang.NoClassDefFoundError: Could not initialize class $line10.$read$

2014-07-09 Thread Michael Armbrust
At first glance that looks like an error with the class shipping in the spark shell. (i.e. the line that you type into the spark shell are compiled into classes and then shipped to the executors where they run). Are you able to run other spark examples with closures in the same shell? Michael

Re: Spark SQL registerAsTable requires a Java Class?

2014-07-08 Thread Michael Armbrust
This is on the roadmap for the next release (1.1) JIRA: SPARK-2179 https://issues.apache.org/jira/browse/SPARK-2179 On Mon, Jul 7, 2014 at 11:48 PM, Ionized ioni...@gmail.com wrote: The Java API requires a Java Class to register as table. // Apply a schema to an RDD of JavaBeans and

Re: Spark SQL registerAsTable requires a Java Class?

2014-07-08 Thread Michael Armbrust
you have an estimate on when some will be available?) On Tue, Jul 8, 2014 at 12:24 AM, Michael Armbrust mich...@databricks.com wrote: This is on the roadmap for the next release (1.1) JIRA: SPARK-2179 https://issues.apache.org/jira/browse/SPARK-2179 On Mon, Jul 7, 2014 at 11:48 PM

Re: [Spark SQL]: Convert SchemaRDD back to RDD

2014-07-08 Thread Michael Armbrust
On Tue, Jul 8, 2014 at 12:43 PM, Pierre B pierre.borckm...@realimpactanalytics.com wrote: 1/ Is there a way to convert a SchemaRDD (for instance loaded from a parquet file) back to a RDD of a given case class? There may be someday, but doing so will either require a lot of reflection or a

Re: SparkSQL with sequence file RDDs

2014-07-07 Thread Michael Armbrust
I haven't heard any reports of this yet, but I don't see any reason why it wouldn't work. You'll need to manually convert the objects that come out of the sequence file into something where SparkSQL can detect the schema (i.e. scala case classes or java beans) before you can register the RDD as a

<    5   6   7   8   9   10   11   >