Re: specifing schema on dataframe

2017-02-05 Thread Michael Armbrust
-dev You can use withColumn to change the type after the data has been loaded . On Sat, Feb 4, 2017 at 6:22 AM, Sam Elamin

Re: frustration with field names in Dataset

2017-02-02 Thread Michael Armbrust
That might be reasonable. At least I can't think of any problems with doing that. On Thu, Feb 2, 2017 at 7:39 AM, Koert Kuipers wrote: > since a dataset is a typed object you ideally don't have to think about > field names. > > however there are operations on Dataset that

Re: Parameterized types and Datasets - Spark 2.1.0

2017-02-01 Thread Michael Armbrust
wing code snippet so the import spark.implicits._ > would take effect: > > // ugly hack to get around Encoder can't be found compile time errors > > private object myImplicits extends SQLImplicits { > > protected override def _sqlContext: SQLContext = MySparkSingleton. > get

Re: Parameterized types and Datasets - Spark 2.1.0

2017-02-01 Thread Michael Armbrust
You need to enforce that an Encoder is available for the type A using a context bound . import org.apache.spark.sql.Encoder abstract class RawTable[A : Encoder](inDir: String) { ... } On Tue, Jan 31, 2017 at 8:12 PM, Don Drake

Re: using withWatermark on Dataset

2017-02-01 Thread Michael Armbrust
Can you give the full stack trace? Also which version of Spark are you running? On Wed, Feb 1, 2017 at 10:38 AM, Jerry Lam wrote: > Hi everyone, > > Anyone knows how to use withWatermark on Dataset? > > I have tried the following but hit this exception: > > dataset

Re: kafka structured streaming source refuses to read

2017-01-30 Thread Michael Armbrust
Thanks for for following up! I've linked the relevant tickets to SPARK-18057 and I targeted it for Spark 2.2. On Sat, Jan 28, 2017 at 10:15 AM, Koert Kuipers wrote: > there was also already an existing spark ticket for

Re: kafka structured streaming source refuses to read

2017-01-27 Thread Michael Armbrust
Yeah, kafka server client compatibility can be pretty confusing and does not give good errors in the case of mismatches. This should be addressed in the next release of kafka (they are adding an API to query the servers capabilities). On Fri, Jan 27, 2017 at 12:56 PM, Koert Kuipers

Re: printSchema showing incorrect datatype?

2017-01-25 Thread Michael Armbrust
Encoders are just an object based view on a Dataset. Until you actually materialize and object, they are not used and thus will not change the schema of the dataframe. On Tue, Jan 24, 2017 at 8:28 AM, Koert Kuipers wrote: > scala> val x = Seq("a", "b").toDF("x") > x:

Re: Setting startingOffsets to earliest in structured streaming never catches up

2017-01-23 Thread Michael Armbrust
+1 to Ryan's suggestion of setting maxOffsetsPerTrigger. This way you can at least see how quickly it is making progress towards catching up. On Sun, Jan 22, 2017 at 7:02 PM, Timothy Chan wrote: > I'm using version 2.02. > > The difference I see between using latest and

Re: Dataset Type safety

2017-01-10 Thread Michael Armbrust
> > As I've specified *.as[Person]* which does schema inferance then > *"option("inferSchema","true")" *is redundant and not needed! The resolution of fields is done by name, not by position for case classes. This is what allows us to support more complex things like JSON or nested structures.

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2016-12-29 Thread Michael Armbrust
We don't support this yet, but I've opened this JIRA as it sounds generally useful: https://issues.apache.org/jira/browse/SPARK-19031 In the mean time you could try implementing your own Source, but that is pretty low level and is not yet a stable API. On Thu, Dec 29, 2016 at 4:05 AM, "Yuanzhe

Re: How to get recent value in spark dataframe

2016-12-16 Thread Michael Armbrust
Oh and to get the null for missing years, you'd need to do an outer join with a table containing all of the years you are interested in. On Fri, Dec 16, 2016 at 3:24 PM, Michael Armbrust <mich...@databricks.com> wrote: > Are you looking for argmax? Here is an example > <https://

Re: How to get recent value in spark dataframe

2016-12-16 Thread Michael Armbrust
Are you looking for argmax? Here is an example . On Wed, Dec 14, 2016 at 8:49 PM, Milin korath wrote: > Hi

Re: Dataset encoders for further types?

2016-12-15 Thread Michael Armbrust
I would have sworn there was a ticket, but I can't find it. So here you go: https://issues.apache.org/jira/browse/SPARK-18891 A work around until that is fixed would be for you to manually specify the kryo encoder

Re: When will multiple aggregations be supported in Structured Streaming?

2016-12-15 Thread Michael Armbrust
What is your use case? On Thu, Dec 15, 2016 at 10:43 AM, ljwagerfield wrote: > The current version of Spark (2.0.2) only supports one aggregation per > structured stream (and will throw an exception if multiple aggregations are > applied). > > Roughly when will

Re: [DataFrames] map function - 2.0

2016-12-15 Thread Michael Armbrust
Experimental in Spark really just means that we are not promising binary compatibly for those functions in the 2.x release line. For Datasets in particular, we want a few releases to make sure the APIs don't have any major gaps before removing the experimental tag. On Thu, Dec 15, 2016 at 1:17

Re: Cached Tables SQL Performance Worse than Uncached

2016-12-15 Thread Michael Armbrust
Its hard to comment on performance without seeing query plans. I'd suggest posting the result of an explain. On Thu, Dec 15, 2016 at 2:14 PM, Warren Kim wrote: > Playing with TPC-H and comparing performance between cached (serialized > in-memory tables) and

Re: [Spark-SQL] collect_list() support for nested collection

2016-12-13 Thread Michael Armbrust
Yes https://databricks-prod-cloudfront.cloud.databricks.com/public/ 4027ec902e239c93eaaa8714f173bcfc/1023043053387187/4464261896877850/ 2840265927289860/latest.html On Tue, Dec 13, 2016 at 10:43 AM, Ninad Shringarpure wrote: > > Hi Team, > > Does Spark 2.0 support

Re: When will Structured Streaming support stream-to-stream joins?

2016-12-08 Thread Michael Armbrust
I would guess Spark 2.3, but maybe sooner maybe later depending on demand. I created https://issues.apache.org/jira/browse/SPARK-18791 so people can describe their requirements / stay informed. On Thu, Dec 8, 2016 at 11:16 AM, ljwagerfield wrote: > Hi there, > >

Re: few basic questions on structured streaming

2016-12-08 Thread Michael Armbrust
> > 1. what happens if an event arrives few days late? Looks like we have an > unbound table with sorted time intervals as keys but I assume spark doesn't > keep several days worth of data in memory but rather it would checkpoint > parts of the unbound table to a storage at a specified interval

Re: get corrupted rows using columnNameOfCorruptRecord

2016-12-06 Thread Michael Armbrust
.where("xxx IS NOT NULL") will give you the rows that couldn't be parsed. On Tue, Dec 6, 2016 at 6:31 AM, Yehuda Finkelstein < yeh...@veracity-group.com> wrote: > Hi all > > > > I’m trying to parse json using existing schema and got rows with NULL’s > > //get schema > > val df_schema =

Re: Writing DataFrame filter results to separate files

2016-12-05 Thread Michael Armbrust
> > 1. In my case, I'd need to first explode my data by ~12x to assign each > record to multiple 12-month rolling output windows. I'm not sure Spark SQL > would be able to optimize this away, combining it with the output writing > to do it incrementally. > You are right, but I wouldn't worry

Re: Writing DataFrame filter results to separate files

2016-12-05 Thread Michael Armbrust
If you repartition($"column") and then do .write.partitionBy("column") you should end up with a single file for each value of the partition column. On Mon, Dec 5, 2016 at 10:59 AM, Everett Anderson wrote: > Hi, > > I have a DataFrame of records with dates, and I'd like

Re: [structured streaming] How to remove outdated data when use Window Operations

2016-12-01 Thread Michael Armbrust
Yes ! On Thu, Dec 1, 2016 at 12:57 PM, ayan guha wrote: > Thanks TD. Will it be available in pyspark too? > On 1 Dec 2016 19:55, "Tathagata Das" wrote: > >> In

Re: How do I flatten JSON blobs into a Data Frame using Spark/Spark SQL

2016-11-28 Thread Michael Armbrust
or us but the code below doesn't require me to pass > schema at all. > > import org.apache.spark.sql._ > val rdd = df2.rdd.map { case Row(j: String) => j } > spark.read.json(rdd).show() > > > On Tue, Nov 22, 2016 at 2:42 PM, Michael Armbrust <mich...@databricks

Re: Any equivalent method lateral and explore

2016-11-22 Thread Michael Armbrust
Both collect_list and explode are available in the function library . The following is an example of using it: df.select($"*", explode($"myArray") as 'arrayItem) On Tue, Nov 22, 2016 at 2:42 PM, Mahender Sarangam

Re: How do I flatten JSON blobs into a Data Frame using Spark/Spark SQL

2016-11-22 Thread Michael Armbrust
ently. Any idea on > when 2.1 will be released? > > Thanks, > kant > > On Mon, Nov 21, 2016 at 5:12 PM, Michael Armbrust <mich...@databricks.com> > wrote: > >> In Spark 2.1 we've added a from_json >> <https://github.com/apache/spark/blob/master/sql/co

Re: How do I persist the data after I process the data with Structured streaming...

2016-11-22 Thread Michael Armbrust
We are looking to add a native JDBC sink in Spark 2.2. Until then you can write your own connector using df.writeStream.foreach. On Tue, Nov 22, 2016 at 12:55 PM, shyla deshpande wrote: > Hi, > > Structured streaming works great with Kafka source but I need to persist

Re: How do I persist the data after I process the data with Structured streaming...

2016-11-22 Thread Michael Armbrust
Forgot the link: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach On Tue, Nov 22, 2016 at 2:40 PM, Michael Armbrust <mich...@databricks.com> wrote: > We are looking to add a native JDBC sink in Spark 2.2. Until then you can > wr

Re: How do I flatten JSON blobs into a Data Frame using Spark/Spark SQL

2016-11-21 Thread Michael Armbrust
In Spark 2.1 we've added a from_json function that I think will do what you want. On Fri, Nov 18, 2016 at 2:29 AM, kant kodali wrote: > This seem to work > >

Re: Stateful aggregations with Structured Streaming

2016-11-21 Thread Michael Armbrust
We are planning on adding mapWithState or something similar in a future release. In the mean time, standard Dataframe aggregations should work (count, sum, etc). If you are looking to do something custom, I'd suggest looking at Aggregators

Re: Spark 2.0.2, Structured Streaming with kafka source... Unable to parse the value to Object..

2016-11-21 Thread Michael Armbrust
You could also do this with Datasets, which will probably be a little more efficient (since you are telling us you only care about one column) ds1.select($"value".as[Array[Byte]]).map(Student.parseFrom) On Thu, Nov 17, 2016 at 1:05 PM, shyla deshpande wrote: > Hello

Re: Create a Column expression from a String

2016-11-21 Thread Michael Armbrust
You are looking for org.apache.spark.sql.functions.expr() On Sat, Nov 19, 2016 at 6:12 PM, Stuart White wrote: > I'd like to allow for runtime-configured Column expressions in my > Spark SQL application. For example, if my application needs a 5-digit > zip code, but

Re: How do I convert json_encoded_blob_column into a data frame? (This may be a feature request)

2016-11-16 Thread Michael Armbrust
On Wed, Nov 16, 2016 at 2:49 AM, Hyukjin Kwon wrote: > Maybe it sounds like you are looking for from_json/to_json functions after > en/decoding properly. > Which are new built-in functions that will be released with Spark 2.1.

Re: type-safe join in the new DataSet API?

2016-11-10 Thread Michael Armbrust
You can groupByKey and then cogroup. On Thu, Nov 10, 2016 at 10:44 AM, Yang wrote: > the new DataSet API is supposed to provide type safety and type checks at > compile time https://spark.apache.org/docs/latest/structured- >

Re: Aggregations on every column on dataframe causing StackOverflowError

2016-11-09 Thread Michael Armbrust
wse/SPARK-18388 > > On Wed, Nov 9, 2016 at 3:08 PM, Michael Armbrust <mich...@databricks.com> > wrote: > >> Which version of Spark? Does seem like a bug. >> >> On Wed, Nov 9, 2016 at 10:06 AM, Raviteja Lokineni < >> raviteja.lokin...@gmail.com> wrote: >

Re: Aggregations on every column on dataframe causing StackOverflowError

2016-11-09 Thread Michael Armbrust
Which version of Spark? Does seem like a bug. On Wed, Nov 9, 2016 at 10:06 AM, Raviteja Lokineni < raviteja.lokin...@gmail.com> wrote: > Does this stacktrace look like a bug guys? Definitely seems like one to me. > > Caused by: java.lang.StackOverflowError > at

Re: Upgrading to Spark 2.0.1 broke array in parquet DataFrame

2016-11-07 Thread Michael Armbrust
If you can reproduce the issue with Spark 2.0.2 I'd suggest opening a JIRA. On Fri, Nov 4, 2016 at 5:11 PM, Sam Goodwin wrote: > I have a table with a few columns, some of which are arrays. Since > upgrading from Spark 1.6 to Spark 2.0.1, the array fields are always

Re: NoSuchElementException

2016-11-07 Thread Michael Armbrust
What are you trying to do? It looks like you are mixing multiple SparkContexts together. On Fri, Nov 4, 2016 at 5:15 PM, Lev Tsentsiper wrote: > My code throws an exception when I am trying to create new DataSet from > within SteamWriter sink > > Simplified version

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-04 Thread Michael Armbrust
to some ordering is an >> operation that can be done efficiently in a single shuffle without first >> figuring out range boundaries. and it is needed for quite a few algos, >> including Window and lots of timeseries stuff. but it seems there is no way >> to express

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Michael Armbrust
s also a total sort under the hood, but its on >> (hashCode(key), secondarySortColumn) which is easier to distribute and >> therefore can be implemented more efficiently. >> >> >> >> >> >> On Thu, Nov 3, 2016 at 8:59 PM, Michael Armbrust <mich...@databricks.

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Michael Armbrust
> > It is still unclear to me why we should remember all these tricks (or add > lots of extra little functions) when this elegantly can be expressed in a > reduce operation with a simple one line lamba function. > I think you can do that too. KeyValueGroupedDataset has a reduceGroups function.

Re: incomplete aggregation in a GROUP BY

2016-11-03 Thread Michael Armbrust
Sounds like a bug, if you can reproduce on 1.6.3 (currently being voted on), then please open a JIRA. On Thu, Nov 3, 2016 at 8:05 AM, Donald Matthews wrote: > While upgrading a program from Spark 1.5.2 to Spark 1.6.2, I've run into a > HiveContext GROUP BY that no longer

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Michael Armbrust
You are looking to perform an *argmax*, which you can do with a single aggregation. Here is an example . On Thu, Nov 3, 2016 at 4:53

Re: How to return a case class in map function?

2016-11-02 Thread Michael Armbrust
Thats a bug. Which version of Spark are you running? Have you tried 2.0.2? On Wed, Nov 2, 2016 at 12:01 AM, 颜发才(Yan Facai) wrote: > Hi, all. > When I use a case class as return value in map function, spark always > raise a ClassCastException. > > I write an demo, like: > >

Re: error: Unable to find encoder for type stored in a Dataset. when trying to map through a DataFrame

2016-11-02 Thread Michael Armbrust
Spark doesn't know how to turn a Seq[Any] back into a row. You would need to create a case class or something where we can figure out the schema. What are you trying to do? If you don't care about specifics fields and you just want to serialize the type you can use kryo: implicit val anyEncoder

Re: Does Data pipeline using kafka and structured streaming work?

2016-11-01 Thread Michael Armbrust
ael, > > > > Thanks for the reply. > > > > The following link says there is a open unresolved Jira for Structured > > streaming support for consuming from Kafka. > > > > https://issues.apache.org/jira/browse/SPARK-15406 > > > > Appreciate your

Re: Does Data pipeline using kafka and structured streaming work?

2016-11-01 Thread Michael Armbrust
I'm not aware of any open issues against the kafka source for structured streaming. On Tue, Nov 1, 2016 at 4:45 PM, shyla deshpande wrote: > I am building a data pipeline using Kafka, Spark streaming and Cassandra. > Wondering if the issues with Kafka source fixed in

Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-11-01 Thread Michael Armbrust
registerTempTable is backed by an in-memory hash table that maps table name (a string) to a logical query plan. Fragments of that logical query plan may or may not be cached (but calling register alone will not result in any materialization of results). In Spark 2.0 we renamed this function to

Re: importing org.apache.spark.Logging class

2016-10-27 Thread Michael Armbrust
This was made internal to Spark. I'd suggest that you use slf4j directly. On Thu, Oct 27, 2016 at 2:42 PM, Reth RM wrote: > Updated spark to version 2.0.0 and have issue with importing > org.apache.spark.Logging > > Any suggested fix for this issue? >

Re: Reading AVRO from S3 - No parallelism

2016-10-27 Thread Michael Armbrust
How big are your avro files? We collapse many small files into a single partition to eliminate scheduler overhead. If you need explicit parallelism you can also repartition. On Thu, Oct 27, 2016 at 5:19 AM, Prithish wrote: > I am trying to read a bunch of AVRO files from a

Re: Dataframe schema...

2016-10-26 Thread Michael Armbrust
On Fri, Oct 21, 2016 at 8:40 PM, Koert Kuipers wrote: > This rather innocent looking optimization flag nullable has caused a lot > of bugs... Makes me wonder if we are better off without it > Yes... my most regretted design decision :( Please give thoughts here:

Re: [Spark 2.0.1] Error in generated code, possible regression?

2016-10-26 Thread Michael Armbrust
uidelines for > tracking down the context of this generated code? > > On Wed, Oct 26, 2016 at 3:42 PM Michael Armbrust <mich...@databricks.com> > wrote: > >> If you have a reproduction you can post for this, it would be great if >> you could open a JIRA. >> >&

Re: [Spark 2.0.1] Error in generated code, possible regression?

2016-10-26 Thread Michael Armbrust
If you have a reproduction you can post for this, it would be great if you could open a JIRA. On Mon, Oct 24, 2016 at 6:21 PM, Efe Selcuk wrote: > I have an application that works in 2.0.0 but has been dying at runtime on > the 2.0.1 distribution. > > at

Re: Resiliency with SparkStreaming - fileStream

2016-10-26 Thread Michael Armbrust
I'll answer in the context of structured streaming (the new streaming API build on DataFrames). When reading from files, the FileSource, records which files are included in each batch inside of the given checkpointLocation. If you fail in the middle of a batch, the streaming engine will retry

Re: LIMIT issue of SparkSQL

2016-10-24 Thread Michael Armbrust
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages

Re: LIMIT issue of SparkSQL

2016-10-23 Thread Michael Armbrust
- dev + user Can you give more info about the query? Maybe a full explain()? Are you using a datasource like JDBC? The API does not currently push down limits, but the documentation talks about how you can use a query instead of a table if that is what you are looking to do. On Mon, Oct 24,

Re: Dataframe schema...

2016-10-20 Thread Michael Armbrust
ains the mixed containsNull = true/false. > Let me know if this helps. > > Thanks, > Muthu > > > > On Wed, Oct 19, 2016 at 6:21 PM, Michael Armbrust <mich...@databricks.com> > wrote: > >> Nullable is just a hint to the optimizer that its impossible for there t

Re: How does Spark determine in-memory partition count when reading Parquet ~files?

2016-10-19 Thread Michael Armbrust
In spark 2.0 we bin-pack small files into a single task to avoid overloading the scheduler. If you want a specific number of partitions you should repartition. If you want to disable this optimization you can set the file open cost very high: spark.sql.files.openCostInBytes On Tue, Oct 18, 2016

Re: Dataframe schema...

2016-10-19 Thread Michael Armbrust
Nullable is just a hint to the optimizer that its impossible for there to be a null value in this column, so that it can avoid generating code for null-checks. When in doubt, we set nullable=true since it is always safer to check. Why in particular are you trying to change the nullability of the

Re: Questions about DataFrame's filter()

2016-09-29 Thread Michael Armbrust
-dev +user It surprises me as `filter()` takes a Column, not a `Row => Boolean`. There are several overloaded versions of Dataset.filter(...) def filter(func: FilterFunction[T]): Dataset[T] def filter(func: (T) ⇒ Boolean): Dataset[T] def filter(conditionExpr: String): Dataset[T] def

Re: Dataset doesn't have partitioner after a repartition on one of the columns

2016-09-28 Thread Michael Armbrust
Hi Darin, In SQL we have finer grained information about partitioning, so we don't use the RDD Partitioner. Here's a notebook that walks

Re: Spark 2.0 Structured Streaming: sc.parallelize in foreach sink cause Task not serializable error

2016-09-26 Thread Michael Armbrust
The code in ForeachWriter runs on the executors, which means that you are not allowed to use the SparkContext. This is probably why you are seeing that exception. On Sun, Sep 25, 2016 at 3:20 PM, Jianshi wrote: > Dear all: > > I am trying out the new released feature of

Re: udf forces usage of Row for complex types?

2016-09-26 Thread Michael Armbrust
I agree this should work. We just haven't finished killing the old reflection based conversion logic now that we have more powerful/efficient encoders. Please open a JIRA. On Sun, Sep 25, 2016 at 2:41 PM, Koert Kuipers wrote: > after having gotten used to have case classes

Re: Spark SQL - Applying transformation on a struct inside an array

2016-09-15 Thread Michael Armbrust
t; I'm currently trying to create a generic transformation mecanism on a >> Dataframe to modify an arbitrary column regardless of the underlying the >> schema. >> >> It's "relatively" straightforward for complex types like >> struct<struct<…>> to ap

Re: [SQL] Why does spark.read.csv.cache give me a WARN about cache but not text?!

2016-08-16 Thread Michael Armbrust
try running explain on each of these. my guess would be caching in broken in some cases. On Tue, Aug 16, 2016 at 6:05 PM, Jacek Laskowski wrote: > Hi, > > Can anyone explain why spark.read.csv("people.csv").cache.show ends up > with a WARN while

Re:

2016-08-14 Thread Michael Armbrust
skowski > > > On Sun, Aug 14, 2016 at 9:51 AM, Michael Armbrust > <mich...@databricks.com> wrote: > > Have you tried doing the join in two parts (id == 0 and id != 0) and then > > doing a union of the results? It is possible that with this technique, > that &g

Re:

2016-08-14 Thread Michael Armbrust
Have you tried doing the join in two parts (id == 0 and id != 0) and then doing a union of the results? It is possible that with this technique, that the join which only contains skewed data would be filtered enough to allow broadcasting of one side. On Sat, Aug 13, 2016 at 11:15 PM, Jestin Ma

Re: call a mysql stored procedure from spark

2016-08-14 Thread Michael Armbrust
As described here , you can use the DataSource API to connect to an external database using JDBC. While the dbtable option is usually just a table name, it can also be any valid SQL command that returns a

Re: Spark 2.0.0 JaninoRuntimeException

2016-08-14 Thread Michael Armbrust
Anytime you see JaninoRuntimeException you are seeing a bug in our code generation. If you can come up with a small example that causes the problem it would be very helpful if you could open a JIRA. On Fri, Aug 12, 2016 at 2:30 PM, dhruve ashar wrote: > I see a similar

Re: [SQL] Why does (0 to 9).toDF("num").as[String] work?

2016-08-14 Thread Michael Armbrust
There are two type systems in play here. Spark SQL's and Scala's. >From the Scala side, this is type-safe. After calling as[String]the Dataset will only return Strings. It is impossible to ever get a class cast exception unless you do your own incorrect casting after the fact. Underneath the

Re: Does Spark SQL support indexes?

2016-08-14 Thread Michael Armbrust
Using df.write.partitionBy is similar to a coarse-grained, clustered index in a traditional database. You can't use it on temporary tables, but it will let you efficiently select small parts of a much larger table. On Sat, Aug 13, 2016 at 11:13 PM, Jörn Franke wrote: >

Re: How to set nullable field when create DataFrame using case class

2016-08-04 Thread Michael Armbrust
Nullable is an optimization for Spark SQL. It is telling spark to not even do an if check when accessing that field. In this case, your data *is* nullable, because timestamp is an object in java and you could put null there. On Thu, Aug 4, 2016 at 2:56 PM, luismattor

Re: error while running filter on dataframe

2016-07-31 Thread Michael Armbrust
You are hitting a bug in code generation. If you can come up with a small reproduction for the problem. It would be very helpful if you could open a JIRA. On Sun, Jul 31, 2016 at 9:14 AM, Tony Lane wrote: > Can someone help me understand this error which occurs while

Re: calling dataset.show on a custom object - displays toString() value as first column and blank for rest

2016-07-31 Thread Michael Armbrust
Can you share you code? This does not happen for me . On Sun, Jul 31, 2016 at 7:16 AM, Rohit Chaddha

Re: spark 2.0 readStream from a REST API

2016-07-31 Thread Michael Armbrust
You have to add a file in resource too (example ). Either that or give a full class name. On Sun, Jul 31, 2016 at 9:45 AM, Ayoub Benali

Re: [Spark 2.0] Why MutableInt cannot be cast to MutableLong?

2016-07-31 Thread Michael Armbrust
Are you sure you are running Spark 2.0? In your stack trace I see SqlNewHadoopRDD, which was removed in #12354 . On Sun, Jul 31, 2016 at 2:12 AM, Chanh Le wrote: > Hi everyone, > Why *MutableInt* cannot be cast to *MutableLong?*

Re: libraryDependencies

2016-07-26 Thread Michael Armbrust
park > [error] import org.apache.spark.mllib.linalg.SingularValueDecomposition > [error] ^ > [error] > /Users/studio/.sbt/0.13/staging/42f93875138543b4e1d3/sparksample/src/main/scala/MyApp.scala:5: > object mllib is not a member of package org.apache.spark &

Re: libraryDependencies

2016-07-26 Thread Michael Armbrust
Also, you'll want all of the various spark versions to be the same. On Tue, Jul 26, 2016 at 12:34 PM, Michael Armbrust <mich...@databricks.com> wrote: > If you are using %% (double) then you do not need _2.11. > > On Tue, Jul 26, 2016 at 12:18 PM, Martin Somers <sono..

Re: transtition SQLContext to SparkSession

2016-07-18 Thread Michael Armbrust
+ dev, reynold Yeah, thats a good point. I wonder if SparkSession.sqlContext should be public/deprecated? On Mon, Jul 18, 2016 at 8:37 AM, Koert Kuipers wrote: > in my codebase i would like to gradually transition to SparkSession, so > while i start using SparkSession i

Re: Saving Table with Special Characters in Columns

2016-07-11 Thread Michael Armbrust
This is protecting you from a limitation in parquet. The library will let you write out invalid files that can't be read back, so we added this check. You can call .format("csv") (in spark 2.0) to switch it to CSV. On Mon, Jul 11, 2016 at 11:16 AM, Tobi Bosede wrote: > Hi

Re: DataFrame Min By Column

2016-07-09 Thread Michael Armbrust
riented Data Scientist > UC Berkeley AMPLab Alumni > > pedrorodriguez.io | 909-353-4423 > github.com/EntilZha | LinkedIn > <https://www.linkedin.com/in/pedrorodriguezscience> > > On July 9, 2016 at 2:19:11 PM, Michael Armbrust (mich...@databricks.com) > wrote: > >

Re: DataFrame Min By Column

2016-07-09 Thread Michael Armbrust
You can do whats called an *argmax/argmin*, where you take the min/max of a couple of columns that have been grouped together as a struct. We sort in column order, so you can put the timestamp first. Here is an example

Re: Multiple aggregations over streaming dataframes

2016-07-07 Thread Michael Armbrust
We are planning to address this issue in the future. At a high level, we'll have to add a delta mode so that updates can be communicated from one operator to the next. On Thu, Jul 7, 2016 at 8:59 AM, Arnaud Bailly wrote: > Indeed. But nested aggregation does not work

Re: Logging trait in Spark 2.0

2016-06-28 Thread Michael Armbrust
I'd suggest using the slf4j APIs directly. They provide a nice stable API that works with a variety of logging backends. This is what Spark does internally. On Sun, Jun 26, 2016 at 4:02 AM, Paolo Patierno wrote: > Yes ... the same here ... I'd like to know the best way for

Re: cast only some columns

2016-06-21 Thread Michael Armbrust
Use `withColumn`. It will replace a column if you give it the same name. On Tue, Jun 21, 2016 at 4:16 AM, pseudo oduesp wrote: > Hi , > with fillna we can select some columns to perform replace some values > with chosing columns with dict > {columns :values } > but

Re: Spark 2.0: Unify DataFrames and Datasets question

2016-06-14 Thread Michael Armbrust
> > 1) What does this really mean to an Application developer? > It means there are less concepts to learn. > 2) Why this unification was needed in Spark 2.0? > To simplify the API and reduce the number of concepts that needed to be learned. We only didn't do it in 1.6 because we didn't want

Re: Is there a limit on the number of tasks in one job?

2016-06-13 Thread Michael Armbrust
You might try with the Spark 2.0 preview. We spent a bunch of time improving the handling of many small files. On Mon, Jun 13, 2016 at 11:19 AM, khaled.hammouda wrote: > I'm trying to use Spark SQL to load json data that are split across about > 70k > files across 24

Re: Spark Thrift Server in CDH 5.3

2016-06-13 Thread Michael Armbrust
I'd try asking on the cloudera forums. On Sun, Jun 12, 2016 at 9:51 PM, pooja mehta wrote: > Hi, > > How do I start Spark Thrift Server with cloudera CDH 5.3? > > Thanks. >

Re: Spark 2.0: Unify DataFrames and Datasets question

2016-06-13 Thread Michael Armbrust
Here's a talk I gave on the topic: https://www.youtube.com/watch?v=i7l3JQRx7Qw http://www.slideshare.net/SparkSummit/structuring-spark-dataframes-datasets-and-streaming-by-michael-armbrust On Mon, Jun 13, 2016 at 4:01 AM, Arun Patel <arunp.bigd...@gmail.com> wrote: > In Spark 2.0, D

Re: Spark 2.0 Streaming and Event Time

2016-06-09 Thread Michael Armbrust
There is no special setting for event time (though we will be adding one for setting a watermark in 2.1 to allow us to reduce the amount of state that needs to be kept around). Just window/groupBy on the on the column that is your event time. On Wed, Jun 8, 2016 at 4:12 PM, Chang Lim

Re: Seq.toDF vs sc.parallelize.toDF = no Spark job vs one - why?

2016-06-09 Thread Michael Armbrust
Look at the explain(). For a Seq we know its just local data so avoid spark jobs for simple operations. In contrast, an RDD is opaque to catalyst so we can't perform that optimization. On Wed, Jun 8, 2016 at 7:49 AM, Jacek Laskowski wrote: > Hi, > > I just noticed it today

Re: Dataset Outer Join vs RDD Outer Join

2016-06-06 Thread Michael Armbrust
t, int)" > > The generated code is passing InternalRow objects into the ByteBuffer > > Starting from two Datasets of types Dataset[(Int, Int)] with expression > $"left._1" === $"right._1". I'll have to spend some time getting a better > understanding of

Re: Dataset Outer Join vs RDD Outer Join

2016-06-01 Thread Michael Armbrust
thing else, I guess Option doesn't have a first class Encoder or DataType > yet and maybe for good reasons. > > I did find the RDD join interface elegant, though. In the ideal world an > API comparable the following would be nice: > https://gist.github.com/rmarsch/3ea78b3a9a8a0e83ce162ed9

Re: Dataset Outer Join vs RDD Outer Join

2016-06-01 Thread Michael Armbrust
Thanks for the feedback. I think this will address at least some of the problems you are describing: https://github.com/apache/spark/pull/13425 On Wed, Jun 1, 2016 at 9:58 AM, Richard Marscher wrote: > Hi, > > I've been working on transitioning from RDD to Datasets in

Re: Map tuple to case class in Dataset

2016-06-01 Thread Michael Armbrust
t;> +---+ >>> >>> FYI >>> >>> On Tue, May 31, 2016 at 7:35 PM, Tim Gautier <tim.gaut...@gmail.com> >>> wrote: >>> >>>> 1.6.1 The exception is a null pointer exception. I'll paste the whole >>>> thing after I fire

Re: Map tuple to case class in Dataset

2016-05-31 Thread Michael Armbrust
Version of Spark? What is the exception? On Tue, May 31, 2016 at 4:17 PM, Tim Gautier wrote: > How should I go about mapping from say a Dataset[(Int,Int)] to a > Dataset[]? > > I tried to use a map, but it throws exceptions: > > case class Test(a: Int) >

Re: Undocumented left join constraint?

2016-05-27 Thread Michael Armbrust
Sounds like: https://issues.apache.org/jira/browse/SPARK-15441, for which a fix is in progress. Please do keep reporting issues though, these are great! Michael On Fri, May 27, 2016 at 1:01 PM, Tim Gautier wrote: > Is it truly impossible to left join a Dataset[T] on the

Re: HiveContext standalone => without a Hive metastore

2016-05-26 Thread Michael Armbrust
You can also just make sure that each user is using their own directory. A rough example can be found in TestHive. Note: in Spark 2.0 there should be no need to use HiveContext unless you need to talk to a metastore. On Thu, May 26, 2016 at 1:36 PM, Mich Talebzadeh

Re: feedback on dataset api explode

2016-05-25 Thread Michael Armbrust
These APIs predate Datasets / encoders, so that is why they are Row instead of objects. We should probably rethink that. Honestly, I usually end up using the column expression version of explode now that it exists (i.e. explode($"arrayCol").as("Item")). It would be great to understand more why

<    1   2   3   4   5   6   7   8   9   10   >