Re: Spark 1.6 udf/udaf alternatives in dataset?

2016-01-11 Thread Michael Armbrust
> > Also, while extracting a value into Dataset using as[U] method, how could > I specify a custom encoder/translation to case class (where I don't have > the same column-name mapping or same data-type mapping)? > There is no public API yet for defining your own encoders. You change the column

Re: [Spark-SQL] Custom aggregate function for GrouppedData

2016-01-06 Thread Michael Armbrust
In Spark 1.6 GroupedDataset has mapGroups, which sounds like what you are looking for. You can also write a custom Aggregator

Re: problem with DataFrame df.withColumn() org.apache.spark.sql.AnalysisException: resolved attribute(s) missing

2016-01-06 Thread Michael Armbrust
> > I really appreciate your help. I The following code works. > Glad you got it to work! Is there a way this example can be added to the distribution to make it > easier for future java programmers? It look me a long time get to this > simple solution. > I'd welcome a pull request that added

Re: problem with DataFrame df.withColumn() org.apache.spark.sql.AnalysisException: resolved attribute(s) missing

2016-01-06 Thread Michael Armbrust
oh, and I think I installed jekyll using "gem install jekyll" On Wed, Jan 6, 2016 at 4:17 PM, Michael Armbrust <mich...@databricks.com> wrote: > from docs/ run: > > SKIP_API=1 jekyll serve --watch > > On Wed, Jan 6, 2016 at 4:12 PM, Andy Davidson < > a...@san

Re: problem with DataFrame df.withColumn() org.apache.spark.sql.AnalysisException: resolved attribute(s) missing

2016-01-06 Thread Michael Armbrust
amples are not rendering correctly. I am on a mac and using > https://itunes.apple.com/us/app/marked-2/id890031187?mt=12 > > I use a emacs or some other text editor to change the md. > > What tools do you use for editing viewing spark markdown files? > > Andy > > > &g

Re: Timeout connecting between workers after upgrade to 1.6

2016-01-06 Thread Michael Armbrust
Logs from the workers? On Wed, Jan 6, 2016 at 1:57 PM, Jeff Jones wrote: > I upgraded our Spark standalone cluster from 1.4.1 to 1.6.0 yesterday. We > are now seeing regular timeouts between two of the workers when making > connections. These workers and the same

Re: Spark 1.6 - Datasets and Avro Encoders

2016-01-05 Thread Michael Armbrust
You could try with the `Encoders.bean` method. It detects classes that have getters and setters. Please report back! On Tue, Jan 5, 2016 at 9:45 AM, Olivier Girardot < o.girar...@lateral-thoughts.com> wrote: > Hi everyone, > considering the new Datasets API, will there be Encoders defined for

Re: problem with DataFrame df.withColumn() org.apache.spark.sql.AnalysisException: resolved attribute(s) missing

2016-01-05 Thread Michael Armbrust
> > I am trying to implement org.apache.spark.ml.Transformer interface in > Java 8. > My understanding is the sudo code for transformers is something like > > @Override > > public DataFrame transform(DataFrame df) { > > 1. Select the input column > > 2. Create a new column > > 3. Append the

Re: Spark 1.6 - Datasets and Avro Encoders

2016-01-05 Thread Michael Armbrust
On Tue, Jan 5, 2016 at 1:31 PM, Olivier Girardot < o.girar...@lateral-thoughts.com> wrote: > I'll do, but if you want my two cents, creating a dedicated "optimised" > encoder for Avro would be great (especially if it's possible to do better > than plain AvroKeyValueOutputFormat with

Re: How to concat few rows into a new column in dataframe

2016-01-05 Thread Michael Armbrust
This would also be possible with an Aggregator in Spark 1.6: https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Dataset%20Aggregator.html On Tue, Jan 5, 2016 at 2:59 PM, Ted Yu wrote: > Something like the following: > > val zeroValue =

Re: Is Spark 1.6 released?

2016-01-04 Thread Michael Armbrust
I also wrote about it here: https://databricks.com/blog/2016/01/04/introducing-spark-datasets.html And put together a bunch of examples here: https://docs.cloud.databricks.com/docs/spark/1.6/index.html On Mon, Jan 4, 2016 at 12:02 PM, Annabel Melongo < melongo_anna...@yahoo.com.invalid> wrote:

Re: Is Spark 1.6 released?

2016-01-04 Thread Michael Armbrust
> > bq. In many cases, the current implementation of the Dataset API does not > yet leverage the additional information it has and can be slower than RDDs. > > Are the characteristics of cases above known so that users can decide which > API to use ? > Lots of back to back operations aren't great

Re: problem with DataFrame df.withColumn() org.apache.spark.sql.AnalysisException: resolved attribute(s) missing

2016-01-04 Thread Michael Armbrust
Its not really possible to convert an RDD to a Column. You can think of a Column as an expression that produces a single output given some set of input columns. If I understand your code correctly, I think this might be easier to express as a UDF: sqlContext.udf().register("stem", new

[ANNOUNCE] Announcing Spark 1.6.0

2016-01-04 Thread Michael Armbrust
Hi All, Spark 1.6.0 is the seventh release on the 1.x line. This release includes patches from 248+ contributors! To download Spark 1.6.0 visit the downloads page. (It may take a while for all mirrors to update.) A huge thanks go to all of the individuals and organizations involved in

Re: trouble understanding data frame memory usage ³java.io.IOException: Unable to acquire memory²

2015-12-28 Thread Michael Armbrust
Unfortunately in 1.5 we didn't force operators to spill when ran out of memory so there is not a lot you can do. It would be awesome if you could test with 1.6 and see if things are any better? On Mon, Dec 28, 2015 at 2:25 PM, Andy Davidson < a...@santacruzintegration.com> wrote: > I am using

Re: partitioning json data in spark

2015-12-28 Thread Michael Armbrust
I don't think thats true (though if the docs are wrong we should fix that). In Spark 1.5 we converted JSON to go through the same code path as parquet. On Mon, Dec 28, 2015 at 12:20 AM, Նարեկ Գալստեան wrote: > Well, I could try to do that, > but *partitionBy *method is

Re: Passing parameters to spark SQL

2015-12-27 Thread Michael Armbrust
The only way to do this for SQL is though the JDBC driver. However, you can use literal values without lossy/unsafe string conversions by using the DataFrame API. For example, to filter: import org.apache.spark.sql.functions._ df.filter($"columnName" === lit(value)) On Sun, Dec 27, 2015 at

Re: Writing partitioned Avro data to HDFS

2015-12-22 Thread Michael Armbrust
You need to say .mode("append") if you want to append to existing data. On Tue, Dec 22, 2015 at 6:48 AM, Yash Sharma wrote: > Well you are right. Having a quick glance at the source[1] I see that the > path creation does not consider the partitions. > > It tries to create

Re: Joining DataFrames - Causing Cartesian Product

2015-12-18 Thread Michael Armbrust
This is fixed in Spark 1.6. On Fri, Dec 18, 2015 at 3:06 PM, Prasad Ravilla wrote: > Changing equality check from “<=>”to “===“ solved the problem. > Performance skyrocketed. > > I am wondering why “<=>” cause a performance degrade? > > val dates = new RetailDates() > val

Re: Is DataFrame.groupBy supposed to preserve order within groups?

2015-12-18 Thread Michael Armbrust
You need to use window functions to get this kind of behavior. Or use max and a struct ( http://stackoverflow.com/questions/13523049/hive-sql-find-the-latest-record) On Thu, Dec 17, 2015 at 11:55 PM, Timothée Carayol < timothee.cara...@gmail.com> wrote: > Hi all, > > I tried to do something

Re: how to make a dataframe of Array[Doubles] ?

2015-12-15 Thread Michael Armbrust
You don't have to turn your array into a tuple, but you do need to have a product that wraps it (this is how we get names for the columns). case class MyData(data: Array[Double]) val df = Seq(MyData(Array(1.0, 2.0, 3.0, 4.0)), ...).toDF() On Mon, Dec 14, 2015 at 9:35 PM, Jeff Zhang

Re: Concatenate a string to a Column of type string in DataFrame

2015-12-14 Thread Michael Armbrust
In earlier versions you should be able to use callUdf or callUDF (depending on which version) and call the hive function "concat". On Sun, Dec 13, 2015 at 3:05 AM, Yanbo Liang wrote: > Sorry, it was added since 1.5.0. > > 2015-12-13 2:07 GMT+08:00 Satish

Re: RuntimeException: Failed to check null bit for primitive int type

2015-12-14 Thread Michael Armbrust
Your code (at com.ctrip.ml.toolimpl.MetadataImpl$$anonfun$1.apply(MetadataImpl.scala:22)) needs to check isNullAt before calling getInt. This is because you cannot return null for a primitive value (Int). On Mon, Dec 14, 2015 at 3:40 AM, zml张明磊 wrote: > Hi, > > > >

Re: Kryo serialization fails when using SparkSQL and HiveContext

2015-12-14 Thread Michael Armbrust
You'll need to either turn off registration (spark.kryo.registrationRequired) or create a custom register spark.kryo.registrator http://spark.apache.org/docs/latest/configuration.html#compression-and-serialization On Mon, Dec 14, 2015 at 2:17 AM, Linh M. Tran wrote: >

Re: spark data frame write.mode("append") bug

2015-12-12 Thread Michael Armbrust
If you want to contribute to the project open a JIRA/PR: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark On Sat, Dec 12, 2015 at 3:13 AM, kali.tumm...@gmail.com < kali.tumm...@gmail.com> wrote: > Hi All, > > >

Re: Window function in Spark SQL

2015-12-11 Thread Michael Armbrust
Can you change permissions on that directory so that hive can write to it? We start up a mini version of hive so that we can use some of its functionality. On Fri, Dec 11, 2015 at 12:47 PM, Sourav Mazumder < sourav.mazumde...@gmail.com> wrote: > In 1.5.x whenever I try to create a HiveContext

Re: is Multiple Spark Contexts is supported in spark 1.5.0 ?

2015-12-11 Thread Michael Armbrust
and explained that multiple contexts per JVM is not really > supported. So, via job server, how does one support multiple contexts in > DIFFERENT JVM's? I specify multiple contexts in the conf file and the > initialization of the subsequent contexts fail. > > > > On Fr

Re: Using TestHiveContext/HiveContext in unit tests

2015-12-11 Thread Michael Armbrust
Just use TestHive. Its a global singlton that you can share across test cases. It has a reset function if you want to clear out the state at the begining of a test. On Fri, Dec 11, 2015 at 2:06 AM, Sahil Sareen wrote: > I'm trying to do this in unit tests: > > val

Re: [Spark-1.5.2][Hadoop-2.6][Spark SQL] Cannot run queries in SQLContext, getting java.lang.NoSuchMethodError

2015-12-09 Thread Michael Armbrust
java.lang.NoSuchMethodError almost always means you have the wrong version of some library (different than what Spark was compiled with) on your classpath.; In this case the Jackson parser. On Wed, Dec 9, 2015 at 10:38 AM, Matheus Ramos wrote: > ​I have a Java

Re: Release data for spark 1.6?

2015-12-09 Thread Michael Armbrust
The release date is "as soon as possible". In order to make an Apache release we must present a release candidate and have 72-hours of voting by the PMC. As soon as there are no known bugs, the vote will pass and 1.6 will be released. In the mean time, I'd love support from the community

Re: Dataset and lambas

2015-12-07 Thread Michael Armbrust
ternal types. could you point me to the jiras, > if they exist already? i just tried to find them but had little luck. > best, koert > > On Sat, Dec 5, 2015 at 4:09 PM, Michael Armbrust <mich...@databricks.com> > wrote: > >> On Sat, Dec 5, 2015 at 9:42 AM, Koert Kuipe

Re: Dataset and lambas

2015-12-07 Thread Michael Armbrust
On Sat, Dec 5, 2015 at 3:27 PM, Deenar Toraskar wrote: > > On a similar note, what is involved in getting native support for some > user defined functions, so that they are as efficient as native Spark SQL > expressions? I had one particular one - an arraySum (element

Re: SparkSQL API to insert DataFrame into a static partition?

2015-12-05 Thread Michael Armbrust
> > Follow up question in this case: what is the cost of registering a temp > table? Is there a limit to the number of temp tables that can be registered > by Spark context? > It is pretty cheap. Just an entry in an in-memory hashtable to a query plan (similar to a view).

Re: Exception in thread "main" java.lang.IncompatibleClassChangeError:

2015-12-05 Thread Michael Armbrust
It seems likely you have conflicting versions of hadoop on your classpath. On Fri, Dec 4, 2015 at 2:52 PM, Prem Sure wrote: > Getting below exception while executing below program in eclipse. > any clue on whats wrong here would be helpful > > *public* *class* WordCount {

Re: Broadcasting a parquet file using spark and python

2015-12-05 Thread Michael Armbrust
e > it by myself (create a broadcast val and implement lookup by myself), but > it will make code super ugly. > > > > I hope we can have either API or hint to enforce the hashjoin (instead of > this suspicious autoBroadcastJoinThreshold parameter). Do we have any > ticket

Re: Dataset and lambas

2015-12-05 Thread Michael Armbrust
On Sat, Dec 5, 2015 at 9:42 AM, Koert Kuipers wrote: > hello all, > DataFrame internally uses a different encoding for values then what the > user sees. i assume the same is true for Dataset? > This is true. We encode objects in the tungsten binary format using code

Re: is Multiple Spark Contexts is supported in spark 1.5.0 ?

2015-12-04 Thread Michael Armbrust
On Fri, Dec 4, 2015 at 11:24 AM, Anfernee Xu wrote: > If multiple users are looking at the same data set, then it's good choice > to share the SparkContext. > > But my usercases are different, users are looking at different data(I use > custom Hadoop InputFormat to load

Re: is Multiple Spark Contexts is supported in spark 1.5.0 ?

2015-12-04 Thread Michael Armbrust
To be clear, I don't think there is ever a compelling reason to create more than one SparkContext in a single application. The context is threadsafe and can launch many jobs in parallel from multiple threads. Even if there wasn't global state that made it unsafe to do so, creating more than one

Re: Spark SQL IN Clause

2015-12-04 Thread Michael Armbrust
The best way to run this today is probably to manually convert the query into a join. I.e. create a dataframe that has all the numbers in it, and join/outer join it with the other table. This way you avoid parsing a gigantic string. On Fri, Dec 4, 2015 at 10:36 AM, Ted Yu

Re: SparkSQL API to insert DataFrame into a static partition?

2015-12-02 Thread Michael Armbrust
you might also coalesce to 1 (or some small number) before writing to avoid creating a lot of files in that partition if you know that there is not a ton of data. On Wed, Dec 2, 2015 at 12:59 AM, Rishi Mishra wrote: > As long as all your data is being inserted by Spark ,

Re: Getting all files of a table

2015-12-01 Thread Michael Armbrust
sqlContext.table("...").inputFiles (this is best effort, but should work for hive tables). Michael On Tue, Dec 1, 2015 at 10:55 AM, Krzysztof Zarzycki wrote: > Hi there, > Do you know how easily I can get a list of all files of a Hive table? > > What I want to achieve is

Re: Relation between RDDs, DataFrames and Project Tungsten

2015-11-23 Thread Michael Armbrust
Here is how I view the relationship between the various components of Spark: - *RDDs - *a low level API for expressing DAGs that will be executed in parallel by Spark workers - *Catalyst -* an internal library for expressing trees that we use to build relational algebra and expression

Re: Error not found value sqlContext

2015-11-19 Thread Michael Armbrust
http://spark.apache.org/docs/latest/sql-programming-guide.html#upgrading-from-spark-sql-10-12-to-13 On Thu, Nov 19, 2015 at 4:19 AM, satish chandra j wrote: > HI All, > we have recently migrated from Spark 1.2.1 to Spark 1.4.0, I am fetching > data from an RDBMS using

Re: Do windowing functions require hive support?

2015-11-18 Thread Michael Armbrust
Yes they do. On Wed, Nov 18, 2015 at 7:49 PM, Stephen Boesch wrote: > But to focus the attention properly: I had already tried out 1.5.2. > > 2015-11-18 19:46 GMT-08:00 Stephen Boesch : > >> Checked out 1.6.0-SNAPSHOT 60 minutes ago >> >> 2015-11-18 19:19

Re: Partitioned Parquet based external table

2015-11-12 Thread Michael Armbrust
Note that if you read in the table using sqlContext.read.parquet(...) or if you use saveAsTable(...) the partitions will be auto-discovered. However, this is not compatible with Hive if you also want to be able to read the data there. On Thu, Nov 12, 2015 at 6:23 AM, Chandra Mohan, Ananda Vel

Re: Querying nested struct fields

2015-11-10 Thread Michael Armbrust
Use a `.`: hc.sql("select _1.item_id from agg_imps_df limit 10").collect() On Tue, Nov 10, 2015 at 11:24 AM, pratik khadloya wrote: > Hello, > > I just saved a PairRDD as a table, but i am not able to query it > correctly. The below and other variations does not seem to

Re: Querying nested struct fields

2015-11-10 Thread Michael Armbrust
un$map$1.apply(Parsers.scala:242) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > > On Tue, Nov 10, 2015 at 11:25 AM Michael Armbrust <mi

Re: Anybody hit this issue in spark shell?

2015-11-09 Thread Michael Armbrust
Yeah, we should probably remove that. On Mon, Nov 9, 2015 at 5:54 PM, Ted Yu wrote: > If there is no option to let shell skip processing @VisibleForTesting , > should the annotation be dropped ? > > Cheers > > On Mon, Nov 9, 2015 at 5:50 PM, Marcelo Vanzin

Re: Overriding Derby in hive-site.xml giving strange results...

2015-11-09 Thread Michael Armbrust
We have two copies of hive running in order to support multiple versions of hive with a single version of Spark. You are see log messages for the version that we use for execution (it just creates a temporary derby metastore). On Mon, Nov 9, 2015 at 3:32 PM, mayurladwa

Re: Fwd: Re: DataFrame equality does not working in 1.5.1

2015-11-06 Thread Michael Armbrust
In particular this is sounding like: https://issues.apache.org/jira/browse/SPARK-10859 On Fri, Nov 6, 2015 at 1:05 PM, Michael Armbrust <mich...@databricks.com> wrote: > I would be great if you could try sql("SET > spark.sql.inMemoryColumnarStorage.partitionPruning=false&

Re: Unable to register UDF with StructType

2015-11-06 Thread Michael Armbrust
e pointers for creating Dynamic > Case Classes. > > TIA. > > > On Fri, Nov 6, 2015 at 12:20 PM, Michael Armbrust <mich...@databricks.com> > wrote: > >> You are returning the type StructType not an instance of a struct (i.e. >> StringType instead of &qu

Re: Fwd: Re: DataFrame equality does not working in 1.5.1

2015-11-06 Thread Michael Armbrust
I would be great if you could try sql("SET spark.sql.inMemoryColumnarStorage.partitionPruning=false") also, try Spark 1.5.2-RC2 On Fri, Nov 6, 2015 at 4:49 AM, Seongduk Cheon wrote: > Hi Yanal! > > Yes,

Re: Spark SQL supports operating on a thrift data sources

2015-11-05 Thread Michael Armbrust
This would make an awesome spark-packge. I'd suggest looking at spark-avro as an example: https://github.com/databricks/spark-avro On Thu, Nov 5, 2015 at 11:21 AM, Jaydeep Vishwakarma < jaydeep.vishwaka...@inmobi.com> wrote: > Hi, > > I want to load thrift serialised data through sqlcontext and

Re: Unable to register UDF with StructType

2015-11-05 Thread Michael Armbrust
You are returning the type StructType not an instance of a struct (i.e. StringType instead of "string"). If you'd like to return a struct you should return a case class. case class StringInfo(numChars: Int, firstLetter: String) udf((s: String) => StringInfo(s.size, s.head)) If you'd like to

Re: Guava ClassLoading Issue When Using Different Hive Metastore Version

2015-11-05 Thread Michael Armbrust
I would be in favor of limiting the scope here. The problem you might run into is that FinalizableReferenceQueue uses the

Re: How to handle Option[Int] in dataframe

2015-11-03 Thread Michael Armbrust
In Spark 1.6 there is an experimental new features called Datasets. You can call df.as[Student] and it should do what you want. Would love any feedback you have if you get a chance to try it out (we'll hopefully publish a preview release next week). On Mon, Nov 2, 2015 at 9:30 PM, manas kar

Re: SparkSQL implicit conversion on insert

2015-11-03 Thread Michael Armbrust
Today you have to do an explicit conversion. I'd really like to open up a public UDT interface as part of Spark Datasets (SPARK-) that would allow you to register custom classes with conversions, but this won't happen till Spark 1.7 likely. On Mon, Nov 2, 2015 at 8:40 PM, Bryan Jeffrey

Re: Pulling data from a secured SQL database

2015-10-31 Thread Michael Armbrust
I would try using the JDBC Data Source and save the data to parquet . You can then put that data on your Spark cluster (probably

Re: key not found: sportingpulse.com in Spark SQL 1.5.0

2015-10-31 Thread Michael Armbrust
This is a bug in DataFrame caching. You can avoid caching or turn off compression. It is fixed in Spark 1.5.1 On Sat, Oct 31, 2015 at 2:31 AM, Silvio Fiorito < silvio.fior...@granturing.com> wrote: > I don’t believe I have it on 1.5.1. Are you able to test the data locally > to confirm, or is

Re: Issue of Hive parquet partitioned table schema mismatch

2015-10-30 Thread Michael Armbrust
> > We have tried schema merging feature, but it's too slow, there're hundreds > of partitions. > Which version of Spark?

Re: SparkSQL: What is the cost of DataFrame.registerTempTable(String)? Can I have multiple tables referencing to the same DataFrame?

2015-10-29 Thread Michael Armbrust
Its super cheap. Its just a hashtable stored on the driver. Yes you can have more than one name for the same DF. On Wed, Oct 28, 2015 at 6:17 PM, Anfernee Xu wrote: > Hi, > > I just want to understand the cost of DataFrame.registerTempTable(String), > is it just a

Re: Collect Column as Array in Grouped DataFrame

2015-10-29 Thread Michael Armbrust
You can use a Hive UDF. import org.apache.spark.sql.functions._ callUDF("collect_set", $"columnName") or just SELECT collect_set(columnName) FROM ... Note that in 1.5 I think this actually does not use tungsten. In 1.6 it should though. I'll add that the experimental Dataset API (preview in

Re: Inconsistent Persistence of DataFrames in Spark 1.5

2015-10-29 Thread Michael Armbrust
There were several bugs in Spark 1.5 and we strongly recommend you upgrade to 1.5.1. If the issue persists it would be helpful to see the result of calling explain. On Wed, Oct 28, 2015 at 4:46 PM, wrote: > Hi, just a couple cents. > > > > are your joining columns

Re: SPARK SQL- Parquet projection pushdown for nested data

2015-10-29 Thread Michael Armbrust
Yeah, this is unfortunate. It would be good to fix this, but its a non-trivial change. Tracked here if you'd like to vote on the issue: https://issues.apache.org/jira/browse/SPARK-4502 On Thu, Oct 29, 2015 at 6:00 PM, Sadhan Sood wrote: > I noticed when querying struct

Re: Hive Version

2015-10-28 Thread Michael Armbrust
Documented here: http://spark.apache.org/docs/1.4.1/sql-programming-guide.html#interacting-with-different-versions-of-hive-metastore In 1.4.1 we compile against 0.13.1 On Wed, Oct 28, 2015 at 2:26 PM, Bryan Jeffrey wrote: > All, > > I am using a HiveContext to create

Re: Spark SQL Persistent Table - joda DateTime Compatability

2015-10-27 Thread Michael Armbrust
You'll need to convert it to a java.sql.Timestamp. On Tue, Oct 27, 2015 at 4:33 PM, Bryan Jeffrey wrote: > Hello. > > I am working to create a persistent table using SparkSQL HiveContext. I > have a basic Windows event case class: > > case class WindowsEvent( >

Re: How to implement zipWithIndex as a UDF?

2015-10-23 Thread Michael Armbrust
The user facing type mapping is documented here: http://spark.apache.org/docs/latest/sql-programming-guide.html#data-types On Fri, Oct 23, 2015 at 12:10 PM, Benyi Wang wrote: > If I have two columns > > StructType(Seq( > StructField("id", LongType), >

Re: How to distinguish columns when joining DataFrames with shared parent?

2015-10-21 Thread Michael Armbrust
Unfortunately, the mechanisms that we use to differentiate columns automatically don't work particularly well in the presence of self joins. However, you can get it work if you use the $"column" syntax consistently: val df = Seq((1, 1), (1, 10), (2, 3), (3, 20), (3, 5), (4, 10)).toDF("key",

Re: How to distinguish columns when joining DataFrames with shared parent?

2015-10-21 Thread Michael Armbrust
--+ > | 10| > | 20| > +-+ > > > scala> j.select(largeValues("lv.value")).show > +-+ > |value| > +-+ > |1| > |5| > +-+ > > Or does this behavior have the same root cause as detailed in Michael's > email? > > > -I

Re: hive thriftserver and fair scheduling

2015-10-20 Thread Michael Armbrust
Not the most obvious place in the docs... but this is probably helpful: https://spark.apache.org/docs/latest/sql-programming-guide.html#scheduling You likely want to put each user in their own pool. On Tue, Oct 20, 2015 at 11:55 AM, Sadhan Sood wrote: > Hi All, > > Does

Re: Spark SQL: Preserving Dataframe Schema

2015-10-20 Thread Michael Armbrust
For compatibility reasons, we always write data out as nullable in parquet. Given that that bit is only an optimization that we don't actually make much use of, I'm curious why you are worried that its changing to true? On Tue, Oct 20, 2015 at 8:24 AM, Jerry Lam wrote: >

Re: Hive custom transform scripts in Spark?

2015-10-20 Thread Michael Armbrust
; org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) > > at scala.Option.foreach(Option.scala:236) > > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693) > > at > org

Re: Hive custom transform scripts in Spark?

2015-10-20 Thread Michael Armbrust
We support TRANSFORM. Are you having a problem using it? On Tue, Oct 20, 2015 at 8:21 AM, wuyangjack wrote: > How to reuse hive custom transform scripts written in python or c++? > > These scripts process data from stdin and print to stdout in spark. > They use the

Re: [spark1.5.1] HiveQl.parse throws org.apache.spark.sql.AnalysisException: null

2015-10-20 Thread Michael Armbrust
Thats not really intended to be a public API as there is some internal setup that needs to be done for Hive to work. Have you created a HiveContext in the same thread? Is there more to that stacktrace? On Tue, Oct 20, 2015 at 2:25 AM, Ayoub wrote: > Hello, > >

Re: Spark SQL: Preserving Dataframe Schema

2015-10-20 Thread Michael Armbrust
> > First, this is not documented in the official document. Maybe we should do > it? http://spark.apache.org/docs/latest/sql-programming-guide.html > Pull requests welcome. > Second, nullability is a significant concept in the database people. It is > part of schema. Extra codes are needed for

Re: Concurrency/Multiple Users

2015-10-19 Thread Michael Armbrust
Unfortunately the implementation of SPARK-2087 didn't have enough tests and got broken in 1.4. In Spark 1.6 we will have a much more solid fix: https://github.com/apache/spark/commit/3390b400d04e40f767d8a51f1078fcccb4e64abd On Mon, Oct 19, 2015 at 2:13 PM, GooniesNeverSayDie

Re: flattening a JSON data structure

2015-10-19 Thread Michael Armbrust
Quickfix is probably to use Seq[Row] instead of Array (the types that are returned are documented here: http://spark.apache.org/docs/latest/sql-programming-guide.html#data-types) Really though you probably want to be using explode. Perhaps something like this would help? import

Re: Dynamic partition pruning

2015-10-16 Thread Michael Armbrust
We don't support dynamic partition pruning yet. On Fri, Oct 16, 2015 at 10:20 AM, Younes Naguib < younes.nag...@tritondigital.com> wrote: > Hi all > > > > I’m running sqls on spark 1.5.1 and using tables based on parquets. > > My tables are not pruned when joined on partition columns. > > Ex: >

Re: Spark SQL running totals

2015-10-15 Thread Michael Armbrust
Check out: https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html On Thu, Oct 15, 2015 at 11:35 AM, Deenar Toraskar wrote: > you can do a self join of the table with itself with the join clause being > a.col1 >= b.col1 > > select

Re: Question about data frame partitioning in Spark 1.3.0

2015-10-14 Thread Michael Armbrust
This won't help as for two reasons: 1) Its all still just creating lineage since you aren't caching the partitioned data. It will still fetch the shuffled blocks for each query. 2) The query optimizer is not aware of RDD level partitioning since its mostly a blackbox. 1) could be fixed by

Re: thriftserver: access temp dataframe from in-memory of spark-shell

2015-10-14 Thread Michael Armbrust
Yes, call startWithContext from the spark shell: https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L56 On Wed, Oct 14, 2015 at 7:10 AM, wrote: > Hi, > > Is it possible to

Re: Question about data frame partitioning in Spark 1.3.0

2015-10-14 Thread Michael Armbrust
Caching the partitioned_df <- this one, but you have to do the partitioning using something like sql("SELECT * FROM ... CLUSTER BY a") as there is no such operation exposed on dataframes. 2) Here is the JIRA: https://issues.apache.org/jira/browse/SPARK-5354

Re: PySpark - Hive Context Does Not Return Results but SQL Context Does for Similar Query.

2015-10-14 Thread Michael Armbrust
No link to the original stack overflow so I can up my reputation? :) This is likely not a difference between HiveContext/SQLContext, but instead a difference between a table where the metadata is coming from the HiveMetastore vs the SparkSQL Data Source API. I would guess that if you create the

Re: Spark DataFrame GroupBy into List

2015-10-14 Thread Michael Armbrust
gt; Can you be more specific on `collect_set`? Is it a built-in function or, > if it is an UDF, how it is defined? > > BR, > Todd Leo > > On Wed, Oct 14, 2015 at 2:12 AM Michael Armbrust <mich...@databricks.com> > wrote: > > import org.apache.spark.sql.functions._ > >

Re: Reusing Spark Functions

2015-10-14 Thread Michael Armbrust
Unless its a broadcast variable, a new copy will be deserialized for every task. On Wed, Oct 14, 2015 at 10:18 AM, Starch, Michael D (398M) < michael.d.sta...@jpl.nasa.gov> wrote: > All, > > Is a Function object in Spark reused on a given executor, or is sent and > deserialized with each new

Re: Spark DataFrame GroupBy into List

2015-10-13 Thread Michael Armbrust
import org.apache.spark.sql.functions._ df.groupBy("category") .agg(callUDF("collect_set", df("id")).as("id_list")) On Mon, Oct 12, 2015 at 11:08 PM, SLiZn Liu wrote: > Hey Spark users, > > I'm trying to group by a dataframe, by appending occurrences into a list >

Re: Using a variable (a column name) in an IF statement in Spark SQL

2015-10-09 Thread Michael Armbrust
wso2.com> wrote: > Spark version: 1.4.1 > The schema is "barcode STRING, items INT" > > On Thu, Oct 8, 2015 at 10:48 PM, Michael Armbrust <mich...@databricks.com> > wrote: > >> Hmm, that looks like it should work to me. What version of Spark? What >

Re: error in sparkSQL 1.5 using count(1) in nested queries

2015-10-09 Thread Michael Armbrust
Thanks for reporting: https://issues.apache.org/jira/browse/SPARK-11032 You can probably workaround this by aliasing the count and just doing a filter on that value afterwards. On Thu, Oct 8, 2015 at 8:47 PM, Jeff Thompson < jeffreykeatingthomp...@gmail.com> wrote: > After upgrading from 1.4.1

Re: How to calculate percentile of a column of DataFrame?

2015-10-09 Thread Michael Armbrust
You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs from dataframes. On Fri, Oct 9, 2015 at 12:01 PM, unk1102 wrote: > Hi how to calculate percentile of a column in a DataFrame? I cant find any > percentile_approx function in Spark aggregation functions. For

Re: How to calculate percentile of a column of DataFrame?

2015-10-09 Thread Michael Armbrust
rom:* Umesh Kacha [mailto:umesh.ka...@gmail.com] > *Sent:* Friday, October 09, 2015 4:10 PM > *To:* Ellafi, Saif A. > *Cc:* Michael Armbrust; user > > *Subject:* Re: How to calculate percentile of a column of DataFrame? > > > > I found it in 1.3 documentation lit says somethi

Re: Using Sqark SQL mapping over an RDD

2015-10-08 Thread Michael Armbrust
You can't do nested operations on RDDs or DataFrames (i.e. you can't create a DataFrame from within a map function). Perhaps if you explain what you are trying to accomplish someone can suggest another way. On Thu, Oct 8, 2015 at 10:10 AM, Afshartous, Nick wrote: > >

Re: Default size of a datatype in SparkSQL

2015-10-08 Thread Michael Armbrust
Its purely for estimation, when guessing when its safe to do a broadcast join. We picked a random number that we thought was larger than the common case (its better to over estimate to avoid OOM). On Wed, Oct 7, 2015 at 10:11 PM, vivek bhaskar wrote: > I want to understand

Re: RowNumber in HiveContext returns null or negative values

2015-10-08 Thread Michael Armbrust
Which version of Spark? On Thu, Oct 8, 2015 at 7:25 AM, wrote: > Hi all, would this be a bug?? > > val ws = Window. > partitionBy("clrty_id"). > orderBy("filemonth_dtt") > > val nm = "repeatMe" >

Re: Using a variable (a column name) in an IF statement in Spark SQL

2015-10-08 Thread Michael Armbrust
Hmm, that looks like it should work to me. What version of Spark? What is the schema of goods? On Thu, Oct 8, 2015 at 6:13 AM, Maheshakya Wijewardena wrote: > Hi, > > Suppose there is data frame called goods with columns "barcode" and > "items". Some of the values in the

Re: Using Sqark SQL mapping over an RDD

2015-10-08 Thread Michael Armbrust
val withoutAnalyticsId = sqlContext.sql("select * from ad_info > where deviceId = '%1s' order by messageTime desc limit 1" format (deviceId)) > >withoutAnalyticsId.take(1)(0) >} > }) > > > > > > Fro

Re: Dataframes - sole data structure for parallel computations?

2015-10-08 Thread Michael Armbrust
Don't worry, the ability to work with domain objects and lambda functions is not going to go away. However, we are looking at ways to leverage Tungsten's improved performance when processing structured data. More details can be found here: https://issues.apache.org/jira/browse/SPARK- On

Re: RowNumber in HiveContext returns null or negative values

2015-10-08 Thread Michael Armbrust
owNumber in HiveContext returns null or negative values > > > > Hi, thanks for looking into. v1.5.1. I am really worried. > > I dont have hive/hadoop for real in the environment. > > > > Saif > > > > *From:* Michael Armbrust [mailto:mich...@databricks.com &

Re: SparkSQL: First query execution is always slower than subsequent queries

2015-10-07 Thread Michael Armbrust
-dev +user 1). Is that the reason why it's always slow in the first run? Or are there > any other reasons? Apparently it loads data to memory every time so it > shouldn't be something to do with disk read should it? > You are probably seeing the effect of the JVMs JIT. The first run is

Re: Does feature parity exist between Spark and PySpark

2015-10-07 Thread Michael Armbrust
> > At my company we use Avro heavily and it's not been fun when i've tried to > work with complex avro schemas and python. This may not be relevant to you > however...otherwise I found Python to be a great fit for Spark :) > Have you tried using https://github.com/databricks/spark-avro ? It

Re: ORC files created by Spark job can't be accessed using hive table

2015-10-06 Thread Michael Armbrust
I believe this is fixed in Spark 1.5.1 as long as the table is only using types that hive understands and is not partitioned. The problem with partitioned tables it that hive does not support dynamic discovery unless you manually run the repair command. On Tue, Oct 6, 2015 at 9:33 AM, Umesh

<    1   2   3   4   5   6   7   8   9   10   >