Re: Spark group by sub coulumn

2015-06-19 Thread Michael Armbrust
You are probably looking to do .select(explode($to), ...) first, which will produce a new row for each value in the input array. On Fri, Jun 19, 2015 at 12:02 AM, Suraj Shetiya surajshet...@gmail.com wrote: Hi, I wanted to obtain a grouped by frame from a dataframe. A snippet of the column

Re: [SparkSQL]. MissingRequirementError when creating dataframe from RDD (new error in 1.4)

2015-06-18 Thread Michael Armbrust
Thanks for reporting. Filed as: https://issues.apache.org/jira/browse/SPARK-8470 On Thu, Jun 18, 2015 at 5:35 PM, Adam Lewandowski adam.lewandow...@gmail.com wrote: Since upgrading to Spark 1.4, I'm getting a scala.reflect.internal.MissingRequirementError when creating a DataFrame from an

Re: confusing ScalaReflectionException with DataFrames in 1.4

2015-06-18 Thread Michael Armbrust
com.rr.data.visits.VisitSequencerRunner ./mvt-master-SNAPSHOT-jar-with-dependencies.jar --- Our jar contains both com.rr.data.visits.orc.OrcReadWrite (which you can see in the stack trace) and the unfound com.rr.data.Visit. I'll open a Jira ticket On Thu, Jun 18, 2015 at 3:26 PM Michael Armbrust

Re: Spark-sql versus Impala versus Hive

2015-06-18 Thread Michael Armbrust
I would also love to see a more recent version of Spark SQL. There have been a lot of performance improvements between 1.2 and 1.4 :) On Thu, Jun 18, 2015 at 3:18 PM, Steve Nunez snu...@hortonworks.com wrote: Interesting. What where the Hive settings? Specifically it would be useful to know

Re: confusing ScalaReflectionException with DataFrames in 1.4

2015-06-18 Thread Michael Armbrust
How are you adding com.rr.data.Visit to spark? With --jars? It is possible we are using the wrong classloader. Could you open a JIRA? On Thu, Jun 18, 2015 at 2:56 PM, Chad Urso McDaniel cha...@gmail.com wrote: We are seeing class exceptions when converting to a DataFrame. Anyone out there

Re: Spark SQL and Skewed Joins

2015-06-16 Thread Michael Armbrust
this would be a great addition to spark, and ideally it belongs in spark core not sql. I agree with the fact that this would be a great addition, but we would likely want a specialized SQL implementation for performance reasons.

Re: cassandra with jdbcRDD

2015-06-16 Thread Michael Armbrust
I would suggest looking at https://github.com/datastax/spark-cassandra-connector On Tue, Jun 16, 2015 at 4:01 AM, Hafiz Mujadid hafizmujadi...@gmail.com wrote: hi all! is there a way to connect cassandra with jdbcRDD ? -- View this message in context:

Re: Spark SQL JDBC Source Join Error

2015-06-14 Thread Michael Armbrust
Sounds like SPARK-5456 https://issues.apache.org/jira/browse/SPARK-5456. Which is fixed in Spark 1.4. On Sun, Jun 14, 2015 at 11:57 AM, Sathish Kumaran Vairavelu vsathishkuma...@gmail.com wrote: Hello Everyone, I pulled 2 different tables from the JDBC source and then joined them using the

Re: DataFrame and JDBC regression?

2015-06-14 Thread Michael Armbrust
Can you please file a JIRA? On Sun, Jun 14, 2015 at 2:20 PM, Peter Haumer phau...@us.ibm.com wrote: Hello. I have an ETL app that appends to a JDBC table new results found at each run. In 1.3.1 I did this: testResultsDF.insertIntoJDBC(CONNECTION_URL + ;user= + USER + ;password= +

Re: [Spark] What is the most efficient way to do such a join and column manipulation?

2015-06-13 Thread Michael Armbrust
Yes, its all just RDDs under the covers. DataFrames/SQL is just a more concise way to express your parallel programs. On Sat, Jun 13, 2015 at 5:25 PM, Rex X dnsr...@gmail.com wrote: Thanks, Don! Does SQL implementation of spark do parallel processing on records by default? -Rex On Sat,

Re: Spark SQL and Skewed Joins

2015-06-12 Thread Michael Armbrust
2. Does 1.3.2 or 1.4 have any enhancements that can help? I tried to use 1.3.1 but SPARK-6967 prohibits me from doing so.Now that 1.4 is available, would any of the JOIN enhancements help this situation? I would try Spark 1.4 after running SET spark.sql.planner.sortMergeJoin=true.

Re: DataFrames for non-SQL computation?

2015-06-11 Thread Michael Armbrust
Yes, DataFrames are for much more than SQL and I would recommend using them where ever possible. It is much easier for us to do optimizations when we have more information about the schema of your data, and as such, most of our on going optimization effort will focus on making DataFrames faster.

Re: SparkSQL can't read S3 path for hive external table

2015-06-01 Thread Michael Armbrust
This sounds like a problem that was fixed in Spark 1.3.1. https://issues.apache.org/jira/browse/SPARK-6351 On Mon, Jun 1, 2015 at 5:44 PM, Akhil Das ak...@sigmoidanalytics.com wrote: This thread

Re: RDD staleness

2015-05-31 Thread Michael Armbrust
Each time you run a Spark SQL query we will create new RDDs that load the data and thus you should see the newest results. There is one caveat: formats that use the native Data Source API (parquet, ORC (in Spark 1.4), JSON (in Spark 1.5)) cache file metadata to speed up interactive querying. To

Re: DataFrame groupBy vs RDD groupBy

2015-05-22 Thread Michael Armbrust
DataFrames have a lot more information about the data, so there is a whole class of optimizations that are possible there that we cannot do in RDDs. This is why we are focusing a lot of effort on this part of the project. In Spark 1.4 you can accomplish what you want using the new window function

Re: Naming an DF aggregated column

2015-05-19 Thread Michael Armbrust
customerDF.groupBy(state).agg(max($discount).alias(newName)) (or .as(...), both functions can take a String or a Symbol) On Tue, May 19, 2015 at 2:11 PM, Cesar Flores ces...@gmail.com wrote: I would like to ask if there is a way of specifying the column name of a data frame aggregation. For

Re: Using groupByKey with Spark SQL

2015-05-15 Thread Michael Armbrust
Perhaps you are looking for GROUP BY and collect_set, which would allow you to stay in SQL. I'll add that in Spark 1.4 you can get access to items of a row by name. On Fri, May 15, 2015 at 10:48 AM, Edward Sargisson ejsa...@gmail.com wrote: Hi all, This might be a question to be answered or

Re: Best practice to avoid ambiguous columns in DataFrame.join

2015-05-15 Thread Michael Armbrust
There are several ways to solve this ambiguity: *1. use the DataFrames to get the attribute so its already resolved and not just a string we need to map to a DataFrame.* df.join(df2, df(_1) === df2(_1)) *2. Use aliases* df.as('a).join(df2.as('b), $a._1 === $b._1) *3. rename the columns as you

Re: how to delete data from table in sparksql

2015-05-14 Thread Michael Armbrust
The list of unsupported hive features should mention that it implicitly includes features added after Hive 13. You cannot yet compile with Hive 13, though we are investigating this for 1.5 On Thu, May 14, 2015 at 6:40 AM, Denny Lee denny.g@gmail.com wrote: Delete from table is available

Re: store hive metastore on persistent store

2015-05-14 Thread Michael Armbrust
You can configure Spark SQLs hive interaction by placing a hive-site.xml file in the conf/ directory. On Thu, May 14, 2015 at 10:24 AM, jamborta jambo...@gmail.com wrote: Hi all, is it possible to set hive.metastore.warehouse.dir, that is internally create by spark, to be stored externally

Re: [Spark SQL 1.3.1] data frame saveAsTable returns exception

2015-05-14 Thread Michael Armbrust
date for Spark version 1.4? Regards, Ishwardeep *From:* Michael Armbrust [mailto:mich...@databricks.com] *Sent:* Wednesday, May 13, 2015 10:54 PM *To:* ayan guha *Cc:* Ishwardeep Singh; user *Subject:* Re: [Spark SQL 1.3.1] data frame saveAsTable returns exception I think

Re: Spark SQL: preferred syntax for column reference?

2015-05-14 Thread Michael Armbrust
that the column reference is valid? Thx. Dean On Wednesday, May 13, 2015, Michael Armbrust mich...@databricks.com wrote: I would not say that either method is preferred (neither is old/deprecated). One advantage to the second is that you are referencing a column from a specific dataframe

Re: Backward compatibility with org.apache.spark.sql.api.java.Row class

2015-05-13 Thread Michael Armbrust
Sorry for missing that in the upgrade guide. As part of unifying the Java and Scala interfaces we got rid of the java specific row. You are correct in assuming that you want to use row in org.apache.spark.sql from both Scala and Java now. On Wed, May 13, 2015 at 2:48 AM, Emerson Castañeda

Re: Spark SQL ArrayIndexOutOfBoundsException

2015-05-12 Thread Michael Armbrust
val trainRDD = rawTrainData.map( rawRow = Row( rawRow.split(,) .map(_.toInt) ) ) The above is creating a Row with a single column that contains a sequence. You need to extract the sequence using varargs: val trainRDD = rawTrainData.map( rawRow = Row( rawRow.split(,) .map(_.toInt): _* )) You

Re: Reading Nested Fields in DataFrames

2015-05-11 Thread Michael Armbrust
Since there is an array here you are probably looking for HiveQL's LATERAL VIEW explode https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView . On Mon, May 11, 2015 at 7:12 AM, ayan guha guha.a...@gmail.com wrote: Typically you would use . notation to access, same way you

Re: Met a problem when using spark to load parquet files with different version schemas

2015-05-11 Thread Michael Armbrust
BTW, I use spark 1.3.1, and already set spark.sql.parquet.useDataSourceApi to false. Schema merging is only supported when this flag is set to true (setting it to false uses old code that will be removed once the new code is proven).

Re: Get a list of temporary RDD tables via Thrift

2015-05-11 Thread Michael Armbrust
Temporary tables are not displayed by SHOW TABLES until Spark 1.3. On Mon, May 11, 2015 at 12:54 PM, Judy Nash judyn...@exchange.microsoft.com wrote: Hi, How can I get a list of temporary tables via Thrift? Have used thrift’s startWithContext and registered a temp table, but not

Re: Spark SQL: STDDEV working in Spark Shell but not in a standalone app

2015-05-11 Thread Michael Armbrust
:* Michael Armbrust [mailto:mich...@databricks.com] *Sent:* Saturday, May 09, 2015 11:32 AM *To:* Oleg Shirokikh *Cc:* user *Subject:* Re: Spark SQL: STDDEV working in Spark Shell but not in a standalone app Are you perhaps using a HiveContext in the shell but a SQLContext in your app? I don't

Re: Met a problem when using spark to load parquet files with different version schemas

2015-05-11 Thread Michael Armbrust
it failed to merge incompatible schemas. I think here it means that, the int schema cannot be merged with the long one. Does it mean that the schema merging doesn't support the same field with different types? -Wei On Mon, May 11, 2015 at 3:10 PM, Michael Armbrust mich...@databricks.com

Re: Spark can not access jar from HDFS !!

2015-05-09 Thread Michael Armbrust
That code path is entirely delegated to hive. Does hive support this? You might try instead using sparkContext.addJar. On Sat, May 9, 2015 at 12:32 PM, Ravindra ravindra.baj...@gmail.com wrote: Hi All, I am trying to create custom udfs with hiveContext as given below - scala

Re: [SQL][Dataframe] Change data source after saveAsParquetFile

2015-05-08 Thread Michael Armbrust
Thats a feature flag for a new code path for reading parquet files. Its only there in case bugs are found in the old path and will be removed once we are sure the new path is solid. On Fri, May 8, 2015 at 8:04 AM, Peter Rudenko petro.rude...@gmail.com wrote: Hm, thanks. Do you know what this

Re: [SparkSQL] cannot filter by a DateType column

2015-05-08 Thread Michael Armbrust
What version of Spark are you using? It appears that at least in master we are doing the conversion correctly, but its possible older versions of applySchema do not. If you can reproduce the same bug in master, can you open a JIRA? On Fri, May 8, 2015 at 1:36 AM, Haopu Wang hw...@qilinsoft.com

Re: Hash Partitioning and Dataframes

2015-05-08 Thread Michael Armbrust
What are you trying to accomplish? Internally Spark SQL will add Exchange operators to make sure that data is partitioned correctly for joins and aggregations. If you are going to do other RDD operations on the result of dataframe operations and you need to manually control the partitioning,

Re: CREATE TABLE ignores database when using PARQUET option

2015-05-08 Thread Michael Armbrust
to hear you guys already working to fix this on future releases. Thanks, Carlos On Fri, May 8, 2015 at 2:43 PM, Michael Armbrust mich...@databricks.com wrote: This is an unfortunate limitation of the datasource api which does not support multiple databases. For parquet in particular (if you

Re: saveAsTable fails on Python with Unresolved plan found

2015-05-07 Thread Michael Armbrust
Sorry for the confusion. SQLContext doesn't have a persistent metastore so its not possible to save data as a table. If anyone wants to contribute, I'd welcome a new query planner strategy for SQLContext that gave a better error message. On Thu, May 7, 2015 at 8:41 AM, Judy Nash

Re: Avro to Parquet ?

2015-05-07 Thread Michael Armbrust
Spark SQL using the Data Source API can also do this with much less code https://twitter.com/michaelarmbrust/status/579346328636891136. https://github.com/databricks/spark-avro On Thu, May 7, 2015 at 8:41 AM, Jonathan Coveney jcove...@gmail.com wrote: A helpful example of how to convert:

Re: AvroFiles

2015-05-07 Thread Michael Armbrust
I would suggest also looking at: https://github.com/databricks/spark-avro On Wed, May 6, 2015 at 10:48 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: Hello, This is how i read Avro data. import org.apache.avro.generic.GenericData import org.apache.avro.generic.GenericRecord import

Re: Possible to use hive-config.xml instead of hive-site.xml for HiveContext?

2015-05-06 Thread Michael Armbrust
I don't think that works: https://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration On Tue, May 5, 2015 at 6:25 PM, nitinkak001 nitinkak...@gmail.com wrote: I am running hive queries from HiveContext, for which we need a hive-site.xml. Is it possible to replace it with

Re: Error in SparkSQL/Scala IDE

2015-05-06 Thread Michael Armbrust
Hi Iulian, The relevant code is in ScalaReflection https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala, and it would be awesome if you could suggest how to fix this more generally. Specifically, this code is also broken when

Re: Two DataFrames with different schema, unionAll issue.

2015-05-05 Thread Michael Armbrust
You need to add a select clause to at least one dataframe to give them the same schema before you can union them (much like in SQL). On Tue, May 5, 2015 at 3:24 AM, Wilhelm niznik.pa...@gmail.com wrote: Hey there, 1.) I'm loading 2 avro files with that have slightly different schema df1 =

Re: Inserting Nulls

2015-05-05 Thread Michael Armbrust
Option only works when you are going from case classes. Just put null into the Row, when you want the value to be null. On Tue, May 5, 2015 at 9:00 AM, Masf masfwo...@gmail.com wrote: Hi. I have a spark application where I store the results into table (with HiveContext). Some of these

Re: spark sql, creating literal columns in java.

2015-05-05 Thread Michael Armbrust
This should work from java too: http://spark.apache.org/docs/1.3.1/api/java/index.html#org.apache.spark.sql.functions$ On Tue, May 5, 2015 at 4:15 AM, Jan-Paul Bultmann janpaulbultm...@me.com wrote: Hey, What is the recommended way to create literal columns in java? Scala has the `lit`

Re: sparksql support hive view

2015-05-04 Thread Michael Armbrust
We support both LATERAL VIEWs (a query language feature that lets you turn a single row into many rows, for example with an explode) and virtual views (a table that is really just a query that is run on demand). On Mon, May 4, 2015 at 7:12 PM, luohui20...@sina.com wrote: guys, just to

Re: Help with Spark SQL Hash Distribution

2015-05-04 Thread Michael Armbrust
If you do a join with at least one equality relationship between the two tables, Spark SQL will automatically hash partition the data and perform the join. If you are looking to prepartition the data, that information is not yet propagated from the in memory cached representation so won't help

Re: Is LIMIT n in Spark SQL useful?

2015-05-04 Thread Michael Armbrust
The JDBC interface for Spark SQL does not support pushing down limits today. On Mon, May 4, 2015 at 8:06 AM, Robin East robin.e...@xense.co.uk wrote: and a further question - have you tried running this query in pqsl? what’s the performance like there? On 4 May 2015, at 16:04, Robin East

Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-05-04 Thread Michael Armbrust
If you data is evenly distributed (i.e. no skewed datapoints in your join keys), it can also help to increase spark.sql.shuffle.partitions (default is 200). On Mon, May 4, 2015 at 8:03 AM, Richard Marscher rmarsc...@localytics.com wrote: In regards to the large GC pauses, assuming you allocated

Re: SparkSQL Nested structure

2015-05-04 Thread Michael Armbrust
You are looking for LATERAL VIEW explode https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-explode in HiveQL. On Mon, May 4, 2015 at 7:49 AM, Giovanni Paolo Gibilisco gibb...@gmail.com wrote: Hi, I'm trying to parse log files generated by Spark using

Re: Is SQLContext thread-safe?

2015-04-30 Thread Michael Armbrust
Unfortunately, I think the SQLParser is not threadsafe. I would recommend using HiveQL. On Thu, Apr 30, 2015 at 4:07 AM, Wangfei (X) wangf...@huawei.com wrote: actually this is a sql parse exception, are you sure your sql is right? 发自我的 iPhone 在 2015年4月30日,18:50,Haopu Wang

Re: casting timestamp into long fail in Spark 1.3.1

2015-04-30 Thread Michael Armbrust
This looks like a bug. Mind opening a JIRA? On Thu, Apr 30, 2015 at 3:49 PM, Justin Yip yipjus...@prediction.io wrote: After some trial and error, using DataType solves the problem: df.withColumn(millis, $eventTime.cast( org.apache.spark.sql.types.LongType) * 1000) Justin On Thu, Apr

Re: [Spark SQL] Problems creating a table in specified schema/database

2015-04-29 Thread Michael Armbrust
No, sorry this is not supported. Support for more than one database is lacking in several areas (though mostly works for hive tables). I'd like to fix this in Spark 1.5. On Tue, Apr 28, 2015 at 1:54 AM, James Aley james.a...@swiftkey.com wrote: Hey all, I'm trying to create tables from

Re: Hive table creation - possible bug in Spark 1.3?

2015-04-22 Thread Michael Armbrust
Sorry for the confusion. We should be more clear about the semantics in the documentation. (PRs welcome :) ) .saveAsTable does not create a hive table, but instead creates a Spark Data Source table. Here the metadata is persisted into Hive, but hive cannot read the tables (as this API support

Re: SparkSQL performance

2015-04-22 Thread Michael Armbrust
using Avro please? Many thanks again! Renato M. 2015-04-21 20:45 GMT+02:00 Michael Armbrust mich...@databricks.com: Here is an example using rows directly: https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#programmatically-specifying-the-schema Avro or parquet input would

Re: SparkSQL performance

2015-04-21 Thread Michael Armbrust
for something that provides a better performance than what we are seeing now. Would you recommend using Avro presentation then? Thanks again! Renato M. 2015-04-21 1:18 GMT+02:00 Michael Armbrust mich...@databricks.com: There is a cost to converting from JavaBeans to Rows and this code path

Re: Join on DataFrames from the same source (Pyspark)

2015-04-21 Thread Michael Armbrust
This is https://issues.apache.org/jira/browse/SPARK-6231 Unfortunately this is pretty hard to fix as its hard for us to differentiate these without aliases. However you can add an alias as follows: from pyspark.sql.functions import * df.alias(a).join(df.alias(b), col(a.col1) == col(b.col1)) On

Re: HiveContext setConf seems not stable

2015-04-21 Thread Michael Armbrust
:37:39 INFO util.SchemaRDDUtils$: BLOCK BTW It worked on 1.2.1... On Thu, Apr 2, 2015 at 11:47 AM, Hao Ren inv...@gmail.com wrote: Hi, Jira created: https://issues.apache.org/jira/browse/SPARK-6675 Thank you. On Wed, Apr 1, 2015 at 7:50 PM, Michael Armbrust mich...@databricks.com

Re: SparkSQL performance

2015-04-20 Thread Michael Armbrust
There is a cost to converting from JavaBeans to Rows and this code path has not been optimized. That is likely what you are seeing. On Mon, Apr 20, 2015 at 3:55 PM, ayan guha guha.a...@gmail.com wrote: SparkSQL optimizes better by column pruning and predicate pushdown, primarily. Here you are

Re: spark sql error with proto/parquet

2015-04-20 Thread Michael Armbrust
You are probably using an encoding that we don't support. I think this PR may be adding that support: https://github.com/apache/spark/pull/5422 On Sat, Apr 18, 2015 at 5:40 PM, Abhishek R. Singh abhis...@tetrationanalytics.com wrote: I have created a bunch of protobuf based parquet files that

Re: ClassCastException processing date fields using spark SQL since 1.3.0

2015-04-16 Thread Michael Armbrust
Filed: https://issues.apache.org/jira/browse/SPARK-6967 Shouldn't they be null? Statistics are only used to eliminate partitions that can't possibly hold matching values. So while you are right this might result in a false positive, that will not result in a wrong answer.

Re: Super slow caching in 1.3?

2015-04-16 Thread Michael Armbrust
the performance of each of the above options is -Original Message- From: Christian Perez [mailto:christ...@svds.com] Sent: Thursday, April 16, 2015 6:09 PM To: Michael Armbrust Cc: user Subject: Re: Super slow caching in 1.3? Hi Michael, Good question! We checked 1.2 and found

Re: How to get a clean DataFrame schema merge

2015-04-15 Thread Michael Armbrust
Schema merging is not the feature you are looking for. It is designed when you are adding new records (that are not associated with old records), which may or may not have new or missing columns. In your case it looks like you have two datasets that you want to load separately and join on a key.

Re: Spark Data Formats ?

2015-04-14 Thread Michael Armbrust
Spark SQL (which also can give you an RDD for use with the standard Spark RDD API) has support for json, parquet, and hive tables http://spark.apache.org/docs/latest/sql-programming-guide.html#data-sources. There is also a library for Avro https://github.com/databricks/spark-avro. On Tue, Apr 14,

Re: Spark SQL reading parquet decimal

2015-04-14 Thread Michael Armbrust
Can you open a JIRA? On Tue, Apr 14, 2015 at 1:56 AM, Clint McNeil cl...@impactradius.com wrote: Hi guys I have parquet data written by Impala: Server version: impalad version 2.1.2-cdh5 RELEASE (build 36aad29cee85794ecc5225093c30b1e06ffb68d3) When using Spark SQL 1.3.0

Re: How to access postgresql on Spark SQL

2015-04-14 Thread Michael Armbrust
There is an example here: http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases On Mon, Apr 13, 2015 at 6:07 PM, doovs...@sina.com wrote: Hi all, Who know how to access postgresql on Spark SQL? Do I need add the postgresql dependency in build.sbt and set

Re: Increase partitions reading Parquet File

2015-04-14 Thread Michael Armbrust
RDDs are immutable. Running .repartition does not change the RDD, but instead returns *a new RDD *with more partitions. On Tue, Apr 14, 2015 at 3:59 AM, Masf masfwo...@gmail.com wrote: Hi. It doesn't work. val file = SqlContext.parquetfile(hdfs://node1/user/hive/warehouse/ file.parquet)

Re: Cannot saveAsParquetFile from a RDD of case class

2015-04-14 Thread Michael Armbrust
More info on why toDF is required: http://spark.apache.org/docs/latest/sql-programming-guide.html#upgrading-from-spark-sql-10-12-to-13 On Tue, Apr 14, 2015 at 6:55 AM, pishen tsai pishe...@gmail.com wrote: I've changed it to import sqlContext.implicits._ but it still doesn't work. (I've

Re: Spark support for Hadoop Formats (Avro)

2015-04-13 Thread Michael Armbrust
The problem is likely that the underlying avro library is reusing objects for speed. You probably need to explicitly copy the values out of the reused record before the collect. On Sat, Apr 11, 2015 at 9:23 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: The read seem to be successfully as the

Re: Spark SQL Parquet as External table - 1.3.x HiveMetastoreType now hidden

2015-04-13 Thread Michael Armbrust
Here is the stack trace. The first part shows the log when the session is started in Tableau. It is using the init sql option on the data connection to create theTEMPORARY table myNodeTable. Ah, I see. thanks for providing the error. The problem here is that temporary tables do not exist in

Re: DataFrame column name restriction

2015-04-11 Thread Michael Armbrust
That is a good question. Names with `.` in them are in particular broken by SPARK-5632 https://issues.apache.org/jira/browse/SPARK-5632, which I'd like to fix. There is a more general question of whether strings that are passed to DataFrames should be treated as quoted identifiers (i.e. `as

Re: Opening many Parquet files = slow

2015-04-08 Thread Michael Armbrust
Thanks for the report. We improved the speed here in 1.3.1 so would be interesting to know if this helps. You should also try disabling schema merging if you do not need that feature (i.e. all of your files are the same schema). sqlContext.load(path, parquet, Map(mergeSchema - false)) On Wed,

Re: The differentce between SparkSql/DataFram join and Rdd join

2015-04-08 Thread Michael Armbrust
(ParallelGC) prio=10 tid=0x7f149402c000 nid=0xe74 runnable VM Periodic Task Thread prio=10 tid=0x7f14940c2800 nid=0xe7c waiting on condition JNI global references: 230 Tell me if anything else is needed. Thank you. Hao. On Tue, Apr 7, 2015 at 8:00 PM, Michael Armbrust mich

Re: Advice using Spark SQL and Thrift JDBC Server

2015-04-08 Thread Michael Armbrust
:* Michael Armbrust; user *Subject:* Re: Advice using Spark SQL and Thrift JDBC Server To use the HiveThriftServer2.startWithContext, I thought one would use the following artifact in the build: org.apache.spark%% spark-hive-thriftserver % 1.3.0 But I am unable to resolve

Re: parquet partition discovery

2015-04-08 Thread Michael Armbrust
Back to the user list so everyone can see the result of the discussion... Ah. It all makes sense now. The issue is that when I created the parquet files, I included an unnecessary directory name (data.parquet) below the partition directories. It’s just a leftover from when I started with

Re: The differentce between SparkSql/DataFram join and Rdd join

2015-04-07 Thread Michael Armbrust
The joins here are totally different implementations, but it is worrisome that you are seeing the SQL join hanging. Can you provide more information about the hang? jstack of the driver and a worker that is processing a task would be very useful. On Tue, Apr 7, 2015 at 8:33 AM, Hao Ren

Re: Advice using Spark SQL and Thrift JDBC Server

2015-04-07 Thread Michael Armbrust
1) What exactly is the relationship between the thrift server and Hive? I'm guessing Spark is just making use of the Hive metastore to access table definitions, and maybe some other things, is that the case? Underneath the covers, the Spark SQL thrift server is executing queries using a

Re: scala.MatchError: class org.apache.avro.Schema (of class java.lang.Class)

2015-04-07 Thread Michael Armbrust
Have you looked at spark-avro? https://github.com/databricks/spark-avro On Tue, Apr 7, 2015 at 3:57 AM, Yamini yamini.m...@gmail.com wrote: Using spark(1.2) streaming to read avro schema based topics flowing in kafka and then using spark sql context to register data as temp table. Avro maven

Re: Advice using Spark SQL and Thrift JDBC Server

2015-04-07 Thread Michael Armbrust
presumably could also be avoided fairly trivially by periodically restarting the server with a new context internally. That certainly beats manual curation of Hive table definitions, if it will work? Thanks again, James. On 7 April 2015 at 19:30, Michael Armbrust mich...@databricks.com

Re: DataFrame groupBy MapType

2015-04-06 Thread Michael Armbrust
In HiveQL, you should be able to express this as: SELECT ... FROM table GROUP BY m['SomeKey'] On Sat, Apr 4, 2015 at 5:25 PM, Justin Yip yipjus...@prediction.io wrote: Hello, I have a case class like this: case class A( m: Map[Long, Long], ... ) and constructed a DataFrame from

Re: DataFrame groupBy MapType

2015-04-06 Thread Michael Armbrust
I'll add that I don't think there is a convenient way to do this in the Column API ATM, but would welcome a JIRA for adding it :) On Mon, Apr 6, 2015 at 1:45 PM, Michael Armbrust mich...@databricks.com wrote: In HiveQL, you should be able to express this as: SELECT ... FROM table GROUP BY m

Re: Spark SQL Parquet as External table - 1.3.x HiveMetastoreType now hidden

2015-04-06 Thread Michael Armbrust
Hey Todd, In migrating to 1.3.x I see that the spark.sql.hive.convertMetastoreParquet is no longer public, so the above no longer works. This was probably just a typo, but to be clear, spark.sql.hive.convertMetastoreParquet is still a supported option and should work. You are correct that

Re: Spark Druid integration

2015-04-06 Thread Michael Armbrust
You could certainly build a connector, but it seems like you would want support for pushing down aggregations to get the benefits of Druid. There are only experimental interfaces for doing so today, but it sounds like a pretty cool project. On Mon, Apr 6, 2015 at 2:23 PM, Paolo Platter

Re: Spark SQL code generation

2015-04-06 Thread Michael Armbrust
the subsequent queries are different. On Mon, Apr 6, 2015 at 2:41 PM, Michael Armbrust mich...@databricks.com wrote: It is generated and cached on each of the executors. On Mon, Apr 6, 2015 at 2:32 PM, Akshat Aranya aara...@gmail.com wrote: Hi, I'm curious as to how Spark does code

Re: Spark SQL code generation

2015-04-06 Thread Michael Armbrust
It is generated and cached on each of the executors. On Mon, Apr 6, 2015 at 2:32 PM, Akshat Aranya aara...@gmail.com wrote: Hi, I'm curious as to how Spark does code generation for SQL queries. Following through the code, I saw that an expression is parsed and compiled into a class using

Re: Super slow caching in 1.3?

2015-04-06 Thread Michael Armbrust
Do you think you are seeing a regression from 1.2? Also, are you caching nested data or flat rows? The in-memory caching is not really designed for nested data and so performs pretty slowly here (its just falling back to kryo and even then there are some locking issues). If so, would it be

Re: Generating a schema in Spark 1.3 failed while using DataTypes.

2015-04-02 Thread Michael Armbrust
Do you have a full stack trace? On Thu, Apr 2, 2015 at 11:45 AM, ogoh oke...@gmail.com wrote: Hello, My ETL uses sparksql to generate parquet files which are served through Thriftserver using hive ql. It especially defines a schema programmatically since the schema can be only known at

Re: Generating a schema in Spark 1.3 failed while using DataTypes.

2015-04-02 Thread Michael Armbrust
(DefaultExecutorFactory.java:64) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) On Thu, Apr 2, 2015 at 2:51 PM, Michael Armbrust

Re: Spark SQL does not read from cached table if table is renamed

2015-04-02 Thread Michael Armbrust
I'll add we just back ported this so it'll be included in 1.2.2 also. On Wed, Apr 1, 2015 at 4:14 PM, Michael Armbrust mich...@databricks.com wrote: This is fixed in Spark 1.3. https://issues.apache.org/jira/browse/SPARK-5195 On Wed, Apr 1, 2015 at 4:05 PM, Judy Nash judyn

RE: Cannot run the example in the Spark 1.3.0 following the document

2015-04-02 Thread Michael Armbrust
Looks like a typo, try: *df.select**(**df**(name), **df**(age) + 1)* Or df.select(name, age) PRs to fix docs are always appreciated :) On Apr 2, 2015 7:44 PM, java8964 java8...@hotmail.com wrote: The import command already run. Forgot the mention, the rest of examples related to df all

Re: Error in SparkSQL/Scala IDE

2015-04-02 Thread Michael Armbrust
This is actually a problem with our use of Scala's reflection library. Unfortunately you need to load Spark SQL using the primordial classloader, otherwise you run into this problem. If anyone from the scala side can hint how we can tell scala.reflect which classloader to use when creating the

Re: Spark SQL does not read from cached table if table is renamed

2015-04-01 Thread Michael Armbrust
This is fixed in Spark 1.3. https://issues.apache.org/jira/browse/SPARK-5195 On Wed, Apr 1, 2015 at 4:05 PM, Judy Nash judyn...@exchange.microsoft.com wrote: Hi all, Noticed a bug in my current version of Spark 1.2.1. After a table is cached with “cache table table” command, query will

Re: HiveContext setConf seems not stable

2015-04-01 Thread Michael Armbrust
Can you open a JIRA please? On Wed, Apr 1, 2015 at 9:38 AM, Hao Ren inv...@gmail.com wrote: Hi, I find HiveContext.setConf does not work correctly. Here are some code snippets showing the problem: snippet 1:

Re: SparkSQL - Caching RDDs

2015-04-01 Thread Michael Armbrust
What do you mean by permanently. If you start up the JDBC server and say CACHE TABLE it will stay cached as long as the server is running. CACHE TABLE is idempotent, so you could even just have that command in your BI tools setup queries. On Wed, Apr 1, 2015 at 11:02 AM, Venkat, Ankam

Re: Error reading smallin in hive table with parquet format

2015-04-01 Thread Michael Armbrust
Can you try with Spark 1.3? Much of this code path has been rewritten / improved in this version. On Wed, Apr 1, 2015 at 7:53 AM, Masf masfwo...@gmail.com wrote: Hi. In Spark SQL 1.2.0, with HiveContext, I'm executing the following statement: CREATE TABLE testTable STORED AS PARQUET AS

Re: Spark 1.3.0 DataFrame and Postgres

2015-04-01 Thread Michael Armbrust
Can you open a JIRA for this please? On Wed, Apr 1, 2015 at 6:14 AM, Ted Yu yuzhih...@gmail.com wrote: +1 on escaping column names. On Apr 1, 2015, at 5:50 AM, fergjo00 johngfergu...@gmail.com wrote: Question: --- Is there a way to have JDBC DataFrames use quoted/escaped

Re: Broadcasting a parquet file using spark and python

2015-04-01 Thread Michael Armbrust
. Is there any workaround to achieve the same with 1.2.1? Thanks, Jitesh On Wed, Apr 1, 2015 at 12:25 AM, Michael Armbrust mich...@databricks.com wrote: In Spark 1.3 I would expect this to happen automatically when the parquet table is small ( 10mb, configurable

Re: Spark SQL saveAsParquet failed after a few waves

2015-04-01 Thread Michael Armbrust
When few waves (1 or 2) are used in a job, LoadApp could finish after a few failures and retries. But when more waves (3) are involved in a job, the job would terminate abnormally. Can you clarify what you mean by waves? Are you inserting from multiple programs concurrently?

Re: spark.sql.Row manipulation

2015-03-31 Thread Michael Armbrust
You can do something like: df.collect().map { case Row(name: String, age1: Int, age2: Int) = ... } On Tue, Mar 31, 2015 at 4:05 PM, roni roni.epi...@gmail.com wrote: I have 2 paraquet files with format e.g name , age, town I read them and then join them to get all the names which are in

Re: Broadcasting a parquet file using spark and python

2015-03-31 Thread Michael Armbrust
In Spark 1.3 I would expect this to happen automatically when the parquet table is small ( 10mb, configurable with spark.sql.autoBroadcastJoinThreshold). If you are running 1.3 and not seeing this, can you show the code you are using to create the table? On Tue, Mar 31, 2015 at 3:25 AM, jitesh129

Re: When will 1.3.1 release?

2015-03-30 Thread Michael Armbrust
I'm hoping to cut an RC this week. We are just waiting for a few other critical fixes. On Mon, Mar 30, 2015 at 12:54 PM, Kelly, Jonathan jonat...@amazon.com wrote: Are you referring to SPARK-6330 https://issues.apache.org/jira/browse/SPARK-6330? If you are able to build Spark from source

Re: data frame API, change groupBy result column name

2015-03-30 Thread Michael Armbrust
You'll need to use the longer form for aggregation: tb2.groupBy(city, state).agg(avg(price).as(newName)).show depending on the language you'll need to import: scala: import org.apache.spark.sql.functions._ python: from pyspark.sql.functions import * On Mon, Mar 30, 2015 at 5:49 PM, Neal Yin

Re: [spark-sql] What is the right way to represent an “Any” type in Spark SQL?

2015-03-28 Thread Michael Armbrust
In this case I'd probably just store it as a String. Our casting rules (which come from Hive) are such that when you use a string as an number of boolean it will be casted to the desired type. Thanks for the PR btw :) On Fri, Mar 27, 2015 at 2:31 PM, Eran Medan ehrann.meh...@gmail.com wrote:

<    1   2   3   4   5   6   7   8   9   10   >