Re: Reading nested JSON data with Spark SQL

2014-11-19 Thread Michael Armbrust
You can extract the nested fields in sql: SELECT field.nestedField ... If you don't do that then nested fields are represented as rows within rows and can be retrieved as follows: t.getAs[Row](0).getInt(0) Also, I would write t.getAs[Buffer[CharSequence]](12) as t.getAs[Seq[String]](12) since

Re: NEW to spark and sparksql

2014-11-19 Thread Michael Armbrust
RDD before performing analytics on it. Thank you for your time and help on this. P.S. I am using python if that makes a difference. On Wed, Nov 19, 2014 at 4:45 PM, Michael Armbrust mich...@databricks.com wrote: In general you should be able to read full directories of files as a single

Re: PairRDDFunctions with Tuple2 subclasses

2014-11-19 Thread Michael Armbrust
I think you should also be able to get away with casting it back and forth in this case using .asInstanceOf. On Wed, Nov 19, 2014 at 4:39 PM, Daniel Siegmann daniel.siegm...@velos.io wrote: I have a class which is a subclass of Tuple2, and I want to use it with PairRDDFunctions. However, I

Re: SparkSQL exception on spark.sql.codegen

2014-11-18 Thread Michael Armbrust
) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) On Tue, Nov 18, 2014 at 11:41 AM, Michael Armbrust mich...@databricks.com wrote: Interesting, I believe we have

Re: independent user sessions with a multi-user spark sql thriftserver (Spark 1.1)

2014-11-17 Thread Michael Armbrust
This is an unfortunate/known issue that we are hoping to address in the next release: https://issues.apache.org/jira/browse/SPARK-2087 I'm not sure how straightforward a fix would be, but it would involve keeping / setting the SessionState for each connection to the server. It would be great if

Re: Exception in spark sql when running a group by query

2014-11-17 Thread Michael Armbrust
You are perhaps hitting an issue that was fixed by #3248 https://github.com/apache/spark/pull/3248? On Mon, Nov 17, 2014 at 9:58 AM, Sadhan Sood sadhan.s...@gmail.com wrote: While testing sparkSQL, we were running this group by with expression query and got an exception. The same query worked

Re: SparkSQL exception on spark.sql.codegen

2014-11-17 Thread Michael Armbrust
What version of Spark SQL? On Sat, Nov 15, 2014 at 10:25 PM, Eric Zhen zhpeng...@gmail.com wrote: Hi all, We run SparkSQL on TPCDS benchmark Q19 with spark.sql.codegen=true, we got exceptions as below, has anyone else saw these before? java.lang.ExceptionInInitializerError at

Re: How do I turn off Parquet logging in a worker?

2014-11-14 Thread Michael Armbrust
Anyone want a PR? Yes please.

Re: Sourcing data from RedShift

2014-11-14 Thread Michael Armbrust
I'd guess that its an s3n://key:secret_key@bucket/path from the UNLOAD command used to produce the data. Xiangrui can correct me if I'm wrong though. On Fri, Nov 14, 2014 at 2:19 PM, Gary Malouf malouf.g...@gmail.com wrote: We have a bunch of data in RedShift tables that we'd like to pull in

Re: filtering a SchemaRDD

2014-11-14 Thread Michael Armbrust
If I use row[6] instead of row[text] I get what I am looking for. However, finding the right numeric index could be a pain. Can I access the fields in a Row of a SchemaRDD by name, so that I can map, filter, etc. without a trial and error process of finding the right int for the

Re: Spark SQL Lazy Schema Evaluation

2014-11-12 Thread Michael Armbrust
There are a few things you can do here: - Infer the schema on a subset of the data, pass that inferred schema (schemaRDD.schema) as the second argument of jsonRDD. - Hand construct a schema and pass it as the second argument including the fields you are interested in. - Instead load the data

Re: scala.MatchError

2014-11-11 Thread Michael Armbrust
Xiangrui is correct that is must be a java bean, also nested classes are not yet supported in java. On Tue, Nov 11, 2014 at 10:11 AM, Xiangrui Meng men...@gmail.com wrote: I think you need a Java bean class instead of a normal class. See example here:

Re: Mapping SchemaRDD/Row to JSON

2014-11-10 Thread Michael Armbrust
There is a JIRA for adding this: https://issues.apache.org/jira/browse/SPARK-4228 Your described approach sounds reasonable. On Mon, Nov 10, 2014 at 5:10 PM, Tobias Pfeiffer t...@preferred.jp wrote: Akshat On Tue, Nov 11, 2014 at 4:12 AM, Akshat Aranya aara...@gmail.com wrote: Does there

Re: Dynamically InferSchema From Hive and Create parquet file

2014-11-07 Thread Michael Armbrust
, November 06, 2014 12:28 PM To: Michael Armbrust Cc: u...@spark.incubator.apache.org Subject: RE: Dynamically InferSchema From Hive and Create parquet file When I create Hive table with Parquet format, it does not create any metadata until data in inserted. So data needs to be there before I infer

Re: loading, querying schemaRDD using SparkSQL

2014-11-06 Thread Michael Armbrust
It can, but currently that method uses the default hive serde which is not very robust (does not deal well with \n in strings) and probably is not super fast. You'll also need to be using a HiveContext for it to work. On Tue, Nov 4, 2014 at 8:20 PM, vdiwakar.malladi vdiwakar.mall...@gmail.com

Re: Dynamically InferSchema From Hive and Create parquet file

2014-11-05 Thread Michael Armbrust
That method is for creating a new directory to hold parquet data when there is no hive metastore available, thus you have to specify the schema. If you've already created the table in the metastore you can just query it using the sql method: javahiveConxted.sql(SELECT * FROM parquetTable); You

Re: SparkSQL - No support for subqueries in 1.2-snapshot?

2014-11-04 Thread Michael Armbrust
This is not supported yet. It would be great if you could open a JIRA (though I think apache JIRA is down ATM). On Tue, Nov 4, 2014 at 9:40 AM, Terry Siu terry@smartfocus.com wrote: I’m trying to execute a subquery inside an IN clause and am encountering an unsupported language feature

Re: loading, querying schemaRDD using SparkSQL

2014-11-04 Thread Michael Armbrust
Temporary tables are local to the context that creates them (just like RDDs). I'd recommend saving the data out as Parquet to share it between contexts. On Tue, Nov 4, 2014 at 3:18 AM, vdiwakar.malladi vdiwakar.mall...@gmail.com wrote: Hi, There is a need in my application to query the

Re: StructField of StructType

2014-11-04 Thread Michael Armbrust
Structs are Rows nested in other rows. This might also be helpful: http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema On Tue, Nov 4, 2014 at 12:21 PM, tridib tridib.sama...@live.com wrote: How do I create a StructField of StructType? I need to

Re: Using SQL statements vs. SchemaRDD methods

2014-11-04 Thread Michael Armbrust
They both compile down to the same logical plans so the performance of running the query should be the same. The Scala DSL uses a lot of Scala magic and thus is experimental where as HiveQL is pretty set in stone. On Tue, Nov 4, 2014 at 5:22 PM, SK skrishna...@gmail.com wrote: SchemaRDD

Re: ParquetFilters and StringType support for GT, GTE, LT, LTE

2014-11-03 Thread Michael Armbrust
That sounds like a regression. Could you open a JIRA with steps to reproduce (https://issues.apache.org/jira/browse/SPARK)? We'll want to fix this before the 1.2 release. On Mon, Nov 3, 2014 at 11:04 AM, Terry Siu terry@smartfocus.com wrote: Is there any reason why StringType is not a

Re: SQL COUNT DISTINCT

2014-11-03 Thread Michael Armbrust
On Mon, Nov 3, 2014 at 12:45 AM, Bojan Kostic blood9ra...@gmail.com wrote: But will this improvement also affect when you want to count distinct on 2 or more fields: SELECT COUNT(f1), COUNT(DISTINCT f2), COUNT(DISTINCT f3), COUNT(DISTINCT f4) FROM parquetFile Unfortunately I think this

Re: NoClassDefFoundError encountered in Spark 1.2-snapshot build with hive-0.13.1 profile

2014-11-03 Thread Michael Armbrust
It is merged! On Mon, Nov 3, 2014 at 12:06 PM, Terry Siu terry@smartfocus.com wrote: Thanks, Kousuke. I’ll wait till this pull request makes it into the master branch. -Terry From: Kousuke Saruta saru...@oss.nttdata.co.jp Date: Monday, November 3, 2014 at 11:11 AM To: Terry Siu

Re: Creating a SchemaRDD from RDD of thrift classes

2014-10-30 Thread Michael Armbrust
That should be possible, although I'm not super familiar with thrift. You'll probably need access to the generated metadata http://people.apache.org/~thejas/thrift-0.9/javadoc/org/apache/thrift/meta_data/package-frame.html . Shameless plug If you find yourself reading a lot of thrift data you

Re: SparkSQL + Hive Cached Table Exception

2014-10-30 Thread Michael Armbrust
Hmmm, this looks like a bug. Can you file a JIRA? On Thu, Oct 30, 2014 at 4:04 PM, Jean-Pascal Billaud j...@tellapart.com wrote: Hi, While testing SparkSQL on top of our Hive metastore, I am getting some java.lang.ArrayIndexOutOfBoundsException while reusing a cached RDD table.

Re: Selecting Based on Nested Values using Language Integrated Query Syntax

2014-10-29 Thread Michael Armbrust
LATERAL VIEW explode(locations) l AS location JOIN locationNames ln ON location.number = ln.streetNumber WHERE location.number = '2300').collect() On Tue, Oct 28, 2014 at 10:19 PM, Michael Armbrust mich...@databricks.com wrote: On Tue, Oct 28, 2014 at 6:56 PM, Corey Nolet cjno...@gmail.com

Re: Does JavaSchemaRDD inherit the Hive partitioning of data?

2014-10-28 Thread Michael Armbrust
DISTRIBUTE BY only promises that data will be collocated, but does not create a partition for each value. You are probably looking for Dynamic Partitions https://cwiki.apache.org/confluence/display/Hive/DynamicPartitions, which was recently merged into HiveContext. On Tue, Oct 28, 2014 at 11:49

Re: Selecting Based on Nested Values using Language Integrated Query Syntax

2014-10-28 Thread Michael Armbrust
Try: address.city.attr On Tue, Oct 28, 2014 at 8:30 AM, Brett Antonides banto...@gmail.com wrote: Hello, Given the following example customers.json file: { name: Sherlock Holmes, customerNumber: 12345, address: { street: 221b Baker Street, city: London, zipcode: NW1 6XE, country:

Re: Does JavaSchemaRDD inherit the Hive partitioning of data?

2014-10-28 Thread Michael Armbrust
This feature is not in 1.1 and is not going to promise one file per unique value of the data. The only way to do that would be to write your own partitioner http://stackoverflow.com/questions/23127329/how-to-define-custom-partitioner-for-spark-rdds-of-equally-sized-partition-where . On Tue, Oct

Re: Selecting Based on Nested Values using Language Integrated Query Syntax

2014-10-28 Thread Michael Armbrust
On Tue, Oct 28, 2014 at 2:19 PM, Corey Nolet cjno...@gmail.com wrote: Is it possible to select if, say, there was an addresses field that had a json array? You can get the Nth item by address.getItem(0). If you want to walk through the whole array look at LATERAL VIEW EXPLODE in HiveQL

Re: Selecting Based on Nested Values using Language Integrated Query Syntax

2014-10-28 Thread Michael Armbrust
: [{ street:Rodeo Dr, number:2300 }]} And query all people who have a location with number = 2300? On Tue, Oct 28, 2014 at 5:30 PM, Michael Armbrust mich...@databricks.com wrote: On Tue, Oct 28, 2014 at 2:19 PM, Corey Nolet cjno...@gmail.com wrote: Is it possible to select if, say

Re: Selecting Based on Nested Values using Language Integrated Query Syntax

2014-10-28 Thread Michael Armbrust
On Tue, Oct 28, 2014 at 6:56 PM, Corey Nolet cjno...@gmail.com wrote: Am I able to do a join on an exploded field? Like if I have another object: { streetNumber:2300, locationName:The Big Building} and I want to join with the previous json by the locations[].number field- is that possible?

Re: Selecting Based on Nested Values using Language Integrated Query Syntax

2014-10-28 Thread Michael Armbrust
JOIN locationNames ln ON location.number = ln.streetNumber WHERE location.number = '2300').collect() On Tue, Oct 28, 2014 at 10:19 PM, Michael Armbrust mich...@databricks.com wrote: On Tue, Oct 28, 2014 at 6:56 PM, Corey Nolet cjno...@gmail.com wrote: Am I able to do a join

Re: Is Spark 1.1.0 incompatible with Hive?

2014-10-27 Thread Michael Armbrust
No such method error almost always means you are mixing different versions of the same library on the classpath. In this case it looks like you have more than one version of guava. Have you added anything to the classpath? On Mon, Oct 27, 2014 at 8:36 AM, nitinkak001 nitinkak...@gmail.com

Re: Spark as Relational Database

2014-10-27 Thread Michael Armbrust
I'd suggest checking out the Spark SQL programming guide to answer this type of query: http://spark.apache.org/docs/latest/sql-programming-guide.html You could also perform it using the raw Spark RDD API http://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.rdd.RDD, but its

Re: Is Spark 1.1.0 incompatible with Hive?

2014-10-27 Thread Michael Armbrust
:331)* *at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)* *at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)* *Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.conf.HiveConf* On Mon, Oct 27, 2014 at 1:57 PM, Michael Armbrust mich

Re: Spark to eliminate full-table scan latency

2014-10-27 Thread Michael Armbrust
You can access cached data in spark through the JDBC server: http://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbc-server On Mon, Oct 27, 2014 at 1:47 PM, Ron Ayoub ronalday...@live.com wrote: We have a table containing 25 features per item id along with

Re: Subquery in having-clause (Spark 1.1.0)

2014-10-27 Thread Michael Armbrust
Yeah, sorry for being unclear. Subquery expressions are not supported. That particular error was coming from the Hive parser. On Mon, Oct 27, 2014 at 4:03 PM, Daniel Klinger d...@web-computing.de wrote: So it dosen't matter which dialect im using? Caus i set spark.sql.dialect to sql. --

Re: How do you use the thrift-server to get data from a Spark program?

2014-10-26 Thread Michael Armbrust
This is very experimental and mostly unsupported, but you can start the JDBC server from within your own programs https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L45 by passing it the HiveContext. On

Re: Is SparkSQL + JDBC server a good approach for caching?

2014-10-24 Thread Michael Armbrust
It does have support for caching using either CACHE TABLE tablename or CACHE TABLE tablename AS SELECT On Fri, Oct 24, 2014 at 1:05 AM, ankits ankitso...@gmail.com wrote: I want to set up spark SQL to allow ad hoc querying over the last X days of processed data, where the data is

Re: [Spark SQL] Setting variables

2014-10-24 Thread Michael Armbrust
You might be hitting: https://issues.apache.org/jira/browse/SPARK-4037 On Fri, Oct 24, 2014 at 11:32 AM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Hi all, I'm trying to set a pool for a JDBC session. I'm connecting to the thrift server via JDBC client. My installation appears to be

Re: Is SparkSQL + JDBC server a good approach for caching?

2014-10-24 Thread Michael Armbrust
tables ? On Fri, Oct 24, 2014 at 2:35 PM, Michael Armbrust mich...@databricks.com wrote: It does have support for caching using either CACHE TABLE tablename or CACHE TABLE tablename AS SELECT On Fri, Oct 24, 2014 at 1:05 AM, ankits ankitso...@gmail.com wrote: I want to set up spark SQL

Re: Is SparkSQL + JDBC server a good approach for caching?

2014-10-24 Thread Michael Armbrust
This is very experimental and mostly unsupported, but you can start the JDBC server from within your own programs https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L45 by passing it the HiveContext. On

Re: Spark: Order by Failed, java.lang.NullPointerException

2014-10-24 Thread Michael Armbrust
Usually when the SparkContext throws an NPE it means that it has been shut down due to some earlier failure. On Wed, Oct 22, 2014 at 5:29 PM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi, I got java.lang.NullPointerException. Please help! sqlContext.sql(select l_orderkey,

Re: Spark 1.1.0 and Hive 0.12.0 Compatibility Issue

2014-10-23 Thread Michael Armbrust
Can you show the DDL for the table? It looks like the SerDe might be saying it will produce a decimal type but is actually producing a string. On Thu, Oct 23, 2014 at 3:17 PM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi My Spark is 1.1.0 and Hive is 0.12, I tried to run the

Re: Does SQLSpark support Hive built in functions?

2014-10-22 Thread Michael Armbrust
Yes, when using a HiveContext. On Wed, Oct 22, 2014 at 2:20 PM, shahab shahab.mok...@gmail.com wrote: Hi, I just wonder if SparkSQL supports Hive built-in functions (e.g. from_unixtime) or any of the functions pointed out here : ( https://cwiki.apache.org/confluence/display/Hive/Tutorial)

Re: Sharing spark context across multiple spark sql cli initializations

2014-10-22 Thread Michael Armbrust
The JDBC server is what you are looking for: http://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbc-server On Wed, Oct 22, 2014 at 11:10 AM, Sadhan Sood sadhan.s...@gmail.com wrote: We want to run multiple instances of spark sql cli on our yarn cluster. Each

Re: [SQL] Is RANK function supposed to work in SparkSQL 1.1.0?

2014-10-21 Thread Michael Armbrust
No, analytic and window functions do not work yet. On Tue, Oct 21, 2014 at 3:00 AM, Pierre B pierre.borckm...@realimpactanalytics.com wrote: Hi! The RANK function is available in hive since version 0.11. When trying to use it in SparkSQL, I'm getting the following exception (full

Re: spark sql: join sql fails after sqlCtx.cacheTable()

2014-10-21 Thread Michael Armbrust
Hmm... I thought HiveContext will only worki if Hive is present. I am curious to know when to use HiveContext and when to use SqlContext. http://spark.apache.org/docs/latest/sql-programming-guide.html#getting-started TLDR; Always use HiveContext if your application does not have a dependency

Re: SchemaRDD.where clause error

2014-10-21 Thread Michael Armbrust
You need to import sqlCtx._ to get access to the implicit conversion. On Tue, Oct 21, 2014 at 2:40 PM, Kevin Paul kevinpaulap...@gmail.com wrote: Hi all, I tried to use the function SchemaRDD.where() but got some error: val people = sqlCtx.sql(select * from people) people.where('age ===

Re: SparkSQL - TreeNodeException for unresolved attributes

2014-10-20 Thread Michael Armbrust
Have you tried this on master? There were several problems with resolution of complex queries that were registered as tables in the 1.1.0 release. On Mon, Oct 20, 2014 at 10:33 AM, Terry Siu terry@smartfocus.com wrote: Hi all, I’m getting a TreeNodeException for unresolved attributes

Re: spark sql: timestamp in json - fails

2014-10-20 Thread Michael Armbrust
I think you are running into a bug that will be fixed by this PR: https://github.com/apache/spark/pull/2850 On Mon, Oct 20, 2014 at 4:34 PM, tridib tridib.sama...@live.com wrote: Hello Experts, After repeated attempt I am unable to run query on map json date string. I tried two approaches:

Re: Help required on exercise Data Exploratin using Spark SQL

2014-10-17 Thread Michael Armbrust
Looks like this data was encoded with an old version of Spark SQL. You'll need to set the flag to interpret binary data as a string. More info on configuration can be found here: http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration sqlContext.sql(set

Re: How to write data into Hive partitioned Parquet table?

2014-10-16 Thread Michael Armbrust
Support for dynamic partitioning is available in master and will be part of Spark 1.2 On Thu, Oct 16, 2014 at 1:08 AM, Banias H banias4sp...@gmail.com wrote: I got tipped by an expert that the error of Unsupported language features in query that I had was due to the fact that SparkSQL does not

Re: SparkSQL: StringType for numeric comparison

2014-10-14 Thread Michael Armbrust
Its much more efficient to store and compute on numeric types than string types. On Tue, Oct 14, 2014 at 1:25 AM, invkrh inv...@gmail.com wrote: Thank you, Michael. In Spark SQL DataType, we have a lot of types, for example, ByteType, ShortType, StringType, etc. These types are used to

Re: Spark SQL HiveContext Projection Pushdown

2014-10-13 Thread Michael Armbrust
Is there any plan to support windowing queries? I know that Shark supported it in its last release and expected it to be already included. Someone from redhat is working on this. Unclear if it will make the 1.2 release.

Re: Spark SQL - custom aggregation function (UDAF)

2014-10-13 Thread Michael Armbrust
Its not on the roadmap for 1.2. I'd suggest opening a JIRA. On Mon, Oct 13, 2014 at 4:28 AM, Pierre B pierre.borckm...@realimpactanalytics.com wrote: Is it planned in a near future ? -- View this message in context:

Re: SparkSQL: StringType for numeric comparison

2014-10-13 Thread Michael Armbrust
This conversion is done implicitly anytime you use a string column in an operation with a numeric column. If you run explain on your query you should see the cast that is inserted. This is intentional and based on the type semantics of Apache Hive. On Mon, Oct 13, 2014 at 9:03 AM, invkrh

Re: persist table schema in spark-sql

2014-10-13 Thread Michael Armbrust
If you are running a version 1.1 you can create external parquet tables. I'd recommend setting spark.sql.hive.convertMetastoreParquet=true. Here's a helper function to do it automatically: /** * Sugar for creating a Hive external table from a parquet path. */ def createParquetTable(name:

Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

2014-10-13 Thread Michael Armbrust
There are some known bug with the parquet serde and spark 1.1. You can try setting spark.sql.hive.convertMetastoreParquet=true to cause spark sql to use built in parquet support when the serde looks like parquet. On Mon, Oct 13, 2014 at 2:57 PM, Terry Siu terry@smartfocus.com wrote: I am

Re: Spark SQL Percentile UDAF

2014-10-09 Thread Michael Armbrust
Please file a JIRA:https://issues.apache.org/jira/browse/SPARK/ https://www.google.com/url?q=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSPARK%2Fsa=Dsntz=1usg=AFQjCNFS_GnMso2OCOITA0TSJ5U10b3JSQ On Thu, Oct 9, 2014 at 6:48 PM, Anand Mohan chinn...@gmail.com wrote: Hi, I just noticed the

Re: Spark SQL HiveContext Projection Pushdown

2014-10-08 Thread Michael Armbrust
to query in SQL and apply scala functions as UDFs in the SQL is extremely convenient. Project pushdown works flawlessly, not much sure about predicate pushdown (we have 90% optional fields in our dataset and I remember Michael Armbrust telling me that this is a bug in Parquet in that it doesnt allow

Re: How to do broadcast join in SparkSQL

2014-10-08 Thread Michael Armbrust
Thanks for the input. We purposefully made sure that the config option did not make it into a release as it is not something that we are willing to support long term. That said we'll try and make this easier in the future either through hints or better support for statistics. In this particular

Re: Support for Parquet V2 in ParquetTableSupport?

2014-10-08 Thread Michael Armbrust
Thats a good question, I'm not sure if that will work. I will note that we are hoping to do some upgrades of our parquet support in the near future. On Tue, Oct 7, 2014 at 10:33 PM, Michael Allman mich...@videoamp.com wrote: Hello, I was interested in testing Parquet V2 with Spark SQL, but

Re: Spark-SQL: SchemaRDD - ClassCastException

2014-10-08 Thread Michael Armbrust
Using SUM on a string should automatically cast the column. Also you can use CAST to change the datatype https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-TypeConversionFunctions . What version of Spark are you running? This could be

Re: Spark-SQL: SchemaRDD - ClassCastException

2014-10-08 Thread Michael Armbrust
, Michael Armbrust mich...@databricks.com wrote: Using SUM on a string should automatically cast the column. Also you can use CAST to change the datatype https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-TypeConversionFunctions . What version of Spark are you

Re: Spark SQL - custom aggregation function (UDAF)

2014-10-06 Thread Michael Armbrust
No, not yet. Only Hive UDAFs are supported. On Mon, Oct 6, 2014 at 2:18 AM, Pei-Lun Lee pl...@appier.com wrote: Hi, Does spark sql currently support user-defined custom aggregation function in scala like the way UDF defined with sqlContext.registerFunction? (not hive UDAF) Thanks, --

Re: SparkSQL on Hive error

2014-10-03 Thread Michael Armbrust
Are you running master? There was briefly a regression here that is hopefully fixed by spark#2635 https://github.com/apache/spark/pull/2635. On Fri, Oct 3, 2014 at 1:43 AM, Kevin Paul kevinpaulap...@gmail.com wrote: Hi all, I tried to launch my application with spark-submit, the command I use

Re: How to make ./bin/spark-sql work with hive?

2014-10-03 Thread Michael Armbrust
Often java.lang.NoSuchMethodError means that you have more than one version of a library on your classpath, in this case it looks like hive. On Thu, Oct 2, 2014 at 8:44 PM, Li HM hmx...@gmail.com wrote: I have rebuild package with -Phive Copied hive-site.xml to conf (I am using hive-0.12)

Re: [SparkSQL] Function parity with Shark?

2014-10-03 Thread Michael Armbrust
(DelegatingMethodAccessorImpl.java:43) ​ Let me know if any of these warrant a JIRA thanks On Thu, Oct 2, 2014 at 2:00 PM, Michael Armbrust mich...@databricks.com wrote: What are the errors you are seeing. All of those functions should work. On Thu, Oct 2, 2014 at 6:56 AM, Yana

Re: How to make ./bin/spark-sql work with hive?

2014-10-03 Thread Michael Armbrust
by: java.lang.ClassNotFoundException: Class org.apache.hadoop.hdfs.server.namenode.ha.IPFailoverProxyProvider not found at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1801) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1893) ... 57 more On Fri, Oct 3, 2014 at 1:55 AM, Michael

Re: Spark SQL: ArrayIndexOutofBoundsException

2014-10-02 Thread Michael Armbrust
The bug is likely in your data. Do you have lines in your input file that do not contain the \t character? If so .split will only return a single element and p(1) from the .map() is going to throw java.lang. ArrayIndexOutOfBoundsException: 1 On Thu, Oct 2, 2014 at 3:35 PM, SK

Re: Load multiple parquet file as single RDD

2014-10-02 Thread Michael Armbrust
parquetFile accepts a comma separated list of files. Also, unionAll does not write to disk. However, unless you are running a recent version (compiled from master since this was added https://github.com/apache/spark/commit/f858f466862541c3faad76a1fa2391f1c17ec9dd) its missing an optimization and

Re: Getting table info from HiveContext

2014-10-02 Thread Michael Armbrust
We actually leave all the DDL commands up to hive, so there is no programatic way to access the things you are looking for. On Thu, Oct 2, 2014 at 5:17 PM, Banias calvi...@yahoo.com.invalid wrote: Hi, Would anybody know how to get the following information from HiveContext given a Hive table

Re: Fwd: Spark SQL: ArrayIndexOutofBoundsException

2014-10-02 Thread Michael Armbrust
This is hard to do in general, but you can get what you are asking for by putting the following class in scope. implicit class BetterRDD[A: scala.reflect.ClassTag](rdd: org.apache.spark.rdd.RDD[A]) { def dropOne = rdd.mapPartitionsWithIndex((i, iter) = if(i == 0 iter.hasNext) { iter.next; iter

Re: org.apache.spark.sql.catalyst.errors.package$TreeNodeException:

2014-10-01 Thread Michael Armbrust
You are likely running into SPARK-3708 https://issues.apache.org/jira/browse/SPARK-3708, which was fixed by #2594 https://github.com/apache/spark/pull/2594 this morning. On Wed, Oct 1, 2014 at 8:09 AM, tonsat ton...@gmail.com wrote: We have a configuration CDH5.0,Spark1.1.0(stand alone) and

Re: Spark Language Integrated SQL for join on expression

2014-09-29 Thread Michael Armbrust
I'll note that the DSL is pretty experimental. That said you should be able to do something like user.id.attr On Mon, Sep 29, 2014 at 3:39 PM, Benyi Wang bewang.t...@gmail.com wrote: scala user res19: org.apache.spark.sql.SchemaRDD = SchemaRDD[0] at RDD at SchemaRDD.scala:98 == Query Plan

Re: view not supported in spark thrift server?

2014-09-28 Thread Michael Armbrust
Views are not supported yet. Its not currently on the near term roadmap, but that can change if there is sufficient demand or someone in the community is interested in implementing them. I do not think it would be very hard. Michael On Sun, Sep 28, 2014 at 11:59 AM, Du Li

Re: Spark SQL question: how to control the storage level of cached SchemaRDD?

2014-09-28 Thread Michael Armbrust
This is not possible until https://github.com/apache/spark/pull/2501 is merged. On Sun, Sep 28, 2014 at 6:39 PM, Haopu Wang hw...@qilinsoft.com wrote: Thanks for the response. From Spark Web-UI's Storage tab, I do see cached RDD there. But the storage level is Memory Deserialized 1x

Re: Spark SQL question: how to control the storage level of cached SchemaRDD?

2014-09-28 Thread Michael Armbrust
You might consider instead storing the data using saveAsParquetFile and then querying that after running sqlContext.parquetFile(...).registerTempTable(...). On Sun, Sep 28, 2014 at 6:43 PM, Michael Armbrust mich...@databricks.com wrote: This is not possible until https://github.com/apache/spark

Re: Is it possible to use Parquet with Dremel encoding

2014-09-27 Thread Michael Armbrust
Based on your first example it looks like what you want is actually run length encoding (which parquet does support https://github.com/Parquet/parquet-format/blob/master/Encodings.md). Repetition and definition levels are used to reconstruct nested or repeated (arrays) data that has been shredded

Re: parquetFile and wilcards

2014-09-24 Thread Michael Armbrust
This behavior is inherited from the parquet input format that we use. You could list the files manually and pass them as a comma separated list. On Wed, Sep 24, 2014 at 7:46 AM, Marius Soutier mps@gmail.com wrote: Hello, sc.textFile and so on support wildcards in their path, but

Re: parquetFile and wilcards

2014-09-24 Thread Michael Armbrust
outside of Spark's control? Nick On Wed, Sep 24, 2014 at 1:01 PM, Michael Armbrust mich...@databricks.com wrote: This behavior is inherited from the parquet input format that we use. You could list the files manually and pass them as a comma separated list. On Wed, Sep 24, 2014 at 7:46 AM

Re: Exception with SparkSql and Avro

2014-09-23 Thread Michael Armbrust
Can you show me the DDL you are using? Here is an example of a way I got the avro serde to work: https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/TestHive.scala#L246 Also, this isn't ready for primetime yet, but a quick plug for some ongoing work:

Re: Spark SQL CLI

2014-09-23 Thread Michael Armbrust
You can't directly query JSON tables from the CLI or JDBC server since temporary tables only live for the life of the Spark Context. This PR will eventually (targeted for 1.2) let you do what you want in pure SQL: https://github.com/apache/spark/pull/2475 On Mon, Sep 22, 2014 at 4:52 PM, Yin

Re: Spark SQL CLI

2014-09-23 Thread Michael Armbrust
AM, Michael Armbrust mich...@databricks.com wrote: You can't directly query JSON tables from the CLI or JDBC server since temporary tables only live for the life of the Spark Context. This PR will eventually (targeted for 1.2) let you do what you want in pure SQL: https://github.com/apache

Re: Spark SQL 1.1.0 - large insert into parquet runs out of memory

2014-09-23 Thread Michael Armbrust
I would hope that things should work for this kind of workflow. I'm curious if you have tried using saveAsParquetFile instead of inserting directly into a hive table (you could still register this as an external table afterwards). Right now inserting into Hive tables is going to through their

Re: SQL status code to indicate success or failure of query

2014-09-23 Thread Michael Armbrust
An exception should be thrown in the case of failure for DDL commands. On Tue, Sep 23, 2014 at 4:55 PM, Du Li l...@yahoo-inc.com.invalid wrote: Hi, After executing sql() in SQLContext or HiveContext, is there a way to tell whether the query/command succeeded or failed? Method sql()

Re: ParquetRecordReader warnings: counter initialization

2014-09-22 Thread Michael Armbrust
These are coming from the parquet library and as far as I know can be safely ignored. On Mon, Sep 22, 2014 at 3:27 AM, Andrew Ash and...@andrewash.com wrote: Hi All, I'm seeing the below WARNINGs in stdout using Spark SQL in Spark 1.1.0 -- is this warning a known issue? I don't see any open

Re: Shuffle size difference - operations on RDD vs. operations on SchemaRDD

2014-09-21 Thread Michael Armbrust
Spark SQL always uses a custom configuration of Kryo under the hood to improve shuffle performance: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlSerializer.scala Michael On Sun, Sep 21, 2014 at 9:04 AM, Grega Kešpret gr...@celtra.com

Re: SQL shell for Spark SQL?

2014-09-18 Thread Michael Armbrust
Check out the Spark SQL cli https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-spark-sql-cli . On Wed, Sep 17, 2014 at 10:50 PM, David Rosenstrauch dar...@darose.net wrote: Is there a shell available for Spark SQL, similar to the way the Shark or Hive shells work?

Re: schema for schema

2014-09-18 Thread Michael Armbrust
This looks like a bug, we are investigating. On Thu, Sep 18, 2014 at 8:49 AM, Eric Friedman eric.d.fried...@gmail.com wrote: I have a SchemaRDD which I've gotten from a parquetFile. Did some transforms on it and now want to save it back out as parquet again. Getting a SchemaRDD proves

Re: problem with HiveContext inside Actor

2014-09-17 Thread Michael Armbrust
- dev Is it possible that you are constructing more than one HiveContext in a single JVM? Due to global state in Hive code this is not allowed. Michael On Wed, Sep 17, 2014 at 7:21 PM, Cheng, Hao hao.ch...@intel.com wrote: Hi, Du I am not sure what you mean “triggers the HiveContext to

Re: SparkSQL hang due to PERFLOG method=acquireReadWriteLocks

2014-09-12 Thread Michael Armbrust
What is in your hive-site.xml? On Thu, Sep 11, 2014 at 11:04 PM, linkpatrickliu linkpatrick...@live.com wrote: I am running Spark Standalone mode with Spark 1.1 I started SparkSQL thrift server as follows: ./sbin/start-thriftserver.sh Then I use beeline to connect to it. Now, I can

Re: Spark SQL Thrift JDBC server deployment for production

2014-09-12 Thread Michael Armbrust
Something like the following should let you launch the thrift server on yarn. HADOOP_CONF_DIR=/etc/hadoop/conf HIVE_SERVER2_THRIFT_PORT=12345 MASTER=yarn- client ./sbin/start-thriftserver.sh On Thu, Sep 11, 2014 at 8:30 PM, Denny Lee denny.g@gmail.com wrote: Could you provide some

Re: SchemaRDD saveToCassandra

2014-09-11 Thread Michael Armbrust
This might be a better question to ask on the cassandra mailing list as I believe that is where the exception is coming from. On Thu, Sep 11, 2014 at 2:37 AM, lmk lakshmi.muralikrish...@gmail.com wrote: Hi, My requirement is to extract certain fields from json files, run queries on them and

Re: Spark SQL -- more than two tables for join

2014-09-10 Thread Michael Armbrust
What version of Spark SQL are you running here? I think a lot of your concerns have likely been addressed in more recent versions of the code / documentation. (Spark 1.1 should be published in the next few days) In particular, for serious applications you should use a HiveContext and HiveQL as

Re: Spark HiveQL support plan

2014-09-10 Thread Michael Armbrust
HiveQL is the default language for the JDBC server which will be available as part of the 1.1 release (coming very soon!). Adding support for calling MLlib and other spark libraries is on the roadmap, but not possible at this moment. On Tue, Sep 9, 2014 at 1:45 PM, XUE, Xiaohui

Re: Spark SQL check if query is completed (pyspark)

2014-09-08 Thread Michael Armbrust
You are probably not getting an error because the exception is happening inside of Hive. I'd still consider this a bug if you'd like to open a JIRA. On Mon, Sep 8, 2014 at 3:02 AM, jamborta jambo...@gmail.com wrote: thank you for the replies. I am running an insert on a join (INSERT

Re: Spark SQL on Cassandra

2014-09-08 Thread Michael Armbrust
I believe DataStax is working on better integration here, but until that is ready you can use the applySchema API. Basically you will convert the CassandraTable into and RDD of Row objects using a .map() and then you can call applySchema (provided by SQLContext) to get a SchemaRDD. More details

<    4   5   6   7   8   9   10   11   >