[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...

2018-12-01 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/23173#discussion_r238077135 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala --- @@ -171,15 +171,21 @@ private[csv] class

[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...

2018-12-01 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/23173#discussion_r238068538 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala --- @@ -171,15 +171,21 @@ private[csv] class

[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...

2018-11-29 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/23173#discussion_r237716913 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/OutputWriter.scala --- @@ -57,6 +57,9 @@ abstract class

[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...

2018-11-29 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/23173#discussion_r237687091 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -1987,6 +1987,18 @@ class CSVSuite extends

[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...

2018-11-29 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/23173#discussion_r237663865 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -1987,6 +1987,18 @@ class CSVSuite extends

[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

2018-11-29 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/23052 it is pretty common for us to write empty dataframe to parquet and later read it back in same for writing to csv with header and reading it back in (with type inference disabled, we assume

[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...

2018-11-29 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/23173#discussion_r237579324 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/OutputWriter.scala --- @@ -57,6 +57,9 @@ abstract class

[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...

2018-11-28 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/23173 i was not aware of SPARK-15473. thanks. let me look at @HyukjinKwon pullreq and mark my jira as a duplicate

[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...

2018-11-28 Thread koertkuipers
GitHub user koertkuipers opened a pull request: https://github.com/apache/spark/pull/23173 [SPARK-26208][SQL] add headers to empty csv files when header=true ## What changes were proposed in this pull request? Add headers to empty csv files when header=true, because

[GitHub] spark issue #21273: [SPARK-17916][SQL] Fix empty string being parsed as null...

2018-09-02 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/21273 it would provide a workaround i think, yes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

[GitHub] spark issue #21273: [SPARK-17916][SQL] Fix empty string being parsed as null...

2018-09-01 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/21273 @HyukjinKwon see https://github.com/apache/spark/pull/22312 --- - To unsubscribe, e-mail: reviews-unsubscr

[GitHub] spark pull request #22312: [SPARK-17916][SQL] Fix new behavior when quote is...

2018-09-01 Thread koertkuipers
GitHub user koertkuipers opened a pull request: https://github.com/apache/spark/pull/22312 [SPARK-17916][SQL] Fix new behavior when quote is set and fix old behavior when quote is unset ## What changes were proposed in this pull request? 1) Set nullValue to quoted empty

[GitHub] spark issue #21273: [SPARK-17916][SQL] Fix empty string being parsed as null...

2018-08-23 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/21273 i would suggest at least that when the quote character is changed that the empty value should change accordingly. an empty value of ```""``` makes no sense if the quote

[GitHub] spark pull request #22123: [SPARK-25134][SQL] Csv column pruning with checki...

2018-08-20 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/22123#discussion_r211309642 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -1603,6 +1603,39 @@ class CSVSuite extends

[GitHub] spark issue #21345: [SPARK-24159] [SS] Enable no-data micro batches for stre...

2018-08-20 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/21345 we are testing spark 2.4 internally and had some unit tests break because of this change i believe. i am not suggesting this should be changed or undone, just wanted to point out

[GitHub] spark issue #21273: [SPARK-17916][SQL] Fix empty string being parsed as null...

2018-08-19 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/21273 @HyukjinKwon see the jira for the example code that reproduces the issue. let me know if you need anything else. best, koert

[GitHub] spark issue #21273: [SPARK-17916][SQL] Fix empty string being parsed as null...

2018-08-19 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/21273 to summarize my findings from jira: this breaks any usage without quoting. for example we remove all characters from our values that need to be quoted (delimiters, newlines) so we know we

[GitHub] spark issue #22123: [SPARK-25134][SQL] Csv column pruning with checking of h...

2018-08-17 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/22123 ``` Test Result (1 failure / +1) org.apache.spark.sql.streaming.FlatMapGroupsWithStateSuite.flatMapGroupsWithState - streaming with processing time timeout - state format version 1

[GitHub] spark pull request #22123: [SPARK-25134][SQL] Csv column pruning with checki...

2018-08-16 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/22123#discussion_r210801081 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -1603,6 +1603,44 @@ class CSVSuite extends

[GitHub] spark pull request #22123: [SPARK-25134][SQL] Csv column pruning with checki...

2018-08-16 Thread koertkuipers
GitHub user koertkuipers opened a pull request: https://github.com/apache/spark/pull/22123 [SPARK-25134][SQL] Csv column pruning with checking of headers throws incorrect error ## What changes were proposed in this pull request? When column pruning is turned

[GitHub] spark issue #21296: [SPARK-24244][SQL] Passing only required columns to the ...

2018-08-15 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/21296 if i do not select a schema (and i use inferSchema), and i do a select for only a few column, does this push down the column selection into the reading of data (for schema inference

[GitHub] spark issue #18714: [SPARK-20236][SQL] dynamic partition overwrite

2018-07-19 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/18714 @cloud-fan i created [SPARK-24860](https://issues.apache.org/jira/browse/SPARK-24860) for this --- - To unsubscribe, e

[GitHub] spark pull request #21818: [SPARK-24860][SQL] Support setting of partitionOv...

2018-07-19 Thread koertkuipers
GitHub user koertkuipers opened a pull request: https://github.com/apache/spark/pull/21818 [SPARK-24860][SQL] Support setting of partitionOverWriteMode in output options for writing DataFrame ## What changes were proposed in this pull request? Besides spark setting

[GitHub] spark issue #18714: [SPARK-20236][SQL] dynamic partition overwrite

2018-07-16 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/18714 @cloud-fan OK, that works just as well --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #18714: [SPARK-20236][SQL] dynamic partition overwrite

2018-07-15 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/18714 should this be exposed per write instead of as a global variable? e.g. dataframe.write.csv.partitionMode(Dynamic).partitionBy(...).save

[GitHub] spark issue #609: SPARK-1691: Support quoted arguments inside of spark-submi...

2017-07-07 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/609 ```OPTS+=" --driver-java-options \"-Da=b -Dx=y\""``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well.

[GitHub] spark issue #609: SPARK-1691: Support quoted arguments inside of spark-submi...

2017-07-06 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/609 @ganeshm25 it seems to work in newer spark versions. i havent tried in spark 1.4.2. however its still very tricky to get it right and i would prefer a simpler solution. --- If your project

[GitHub] spark issue #17660: [SPARK-20359][SQL] catch NPE in EliminateOuterJoin optim...

2017-04-18 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/17660 @cloud-fan switching to lazy vals to avoid these predicates being evaluated when they are not used seems to work. so i think this is a better (more targeted) solution for now, and i removed

[GitHub] spark issue #17660: [SPARK-20359][SQL] catch NPE in EliminateOuterJoin optim...

2017-04-18 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/17660 I see. let me check if making leftHasNonNullPredicate and rightHasNonNullPredicate lazy solves it then On Apr 17, 2017 23:44, "Wenchen Fan" <notificati...@gith

[GitHub] spark pull request #17660: [SPARK-20359][SQL] catch NPE in EliminateOuterJoi...

2017-04-17 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/17660#discussion_r111842598 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala --- @@ -124,7 +125,15 @@ case class EliminateOuterJoin(conf

[GitHub] spark pull request #17660: [SPARK-20359][SQL] catch NPE in EliminateOuterJoi...

2017-04-17 Thread koertkuipers
GitHub user koertkuipers opened a pull request: https://github.com/apache/spark/pull/17660 [SPARK-20359][SQL] catch NPE in EliminateOuterJoin optimization catch NPE in EliminateOuterJoin and add test in DataFrameSuite to confirm NPE is no longer thrown ## What changes were

[GitHub] spark issue #17639: [SPARK-19716][SQL][follow-up] UnresolvedMapObjects shoul...

2017-04-14 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/17639 @cloud-fan thanks for doing this --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark pull request #16889: [SPARK-17668][SQL] Use Expressions for conversion...

2017-03-29 Thread koertkuipers
Github user koertkuipers closed the pull request at: https://github.com/apache/spark/pull/16889 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark issue #16889: [SPARK-17668][SQL] Use Expressions for conversions to/fr...

2017-03-29 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/16889 i am going to close this for now since i dont think this is an optimal solution --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark pull request #16889: [SPARK-17668][SQL] Use Expressions for conversion...

2017-02-10 Thread koertkuipers
GitHub user koertkuipers opened a pull request: https://github.com/apache/spark/pull/16889 [SPARK-17668][SQL] Use Expressions for conversions to/from user types in UDFs ## What changes were proposed in this pull request? do not merge this is a first attempt at trying

[GitHub] spark issue #9565: [SPARK-11593][SQL] Replace catalyst converter with RowEnc...

2017-02-04 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/9565 i think this would be very helpful. the difference in behaviour of scala udfs and scala functions used in dataset transformations is a constant source of confusion for my users

[GitHub] spark issue #16479: [SPARK-19085][SQL] cleanup OutputWriterFactory and Outpu...

2017-01-22 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/16479 i will just copy the conversion code over for now thx --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark issue #16479: [SPARK-19085][SQL] cleanup OutputWriterFactory and Outpu...

2017-01-22 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/16479 how "internal" are these interfaces really? every time a change like this is made spark-avro breaks --- If your project is set up for it, you can reply to this email and have

[GitHub] spark issue #16143: [SPARK-18711][SQL] should disable subexpression eliminat...

2016-12-05 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/16143 thanks for getting this fixed so fast --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #15979: [SPARK-18251][SQL] the type of Dataset can't be Option o...

2016-12-04 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/15979 admittedly the result looks weird. it really should be: +---++ |key|count(1)| +---++ | null| 1| | [1,1]| 1

[GitHub] spark issue #15979: [SPARK-18251][SQL] the type of Dataset can't be Option o...

2016-12-04 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/15979 spark 2.0.x does not have mapValues. but this works: scala> Seq(("a", Some((1, 1))), ("a", None)).toDS.groupByKey(_._2).count.show +---++

[GitHub] spark issue #15979: [SPARK-18251][SQL] the type of Dataset can't be Option o...

2016-12-04 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/15979 Yes it worked before On Dec 4, 2016 02:33, "Wenchen Fan" <notificati...@github.com> wrote: > val x: Dataset[String, Option[(String, String)]] = ...

[GitHub] spark issue #15979: [SPARK-18251][SQL] the type of Dataset can't be Option o...

2016-12-03 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/15979 this means anything that uses an encoder can no longer use Option[_ <: Product]. encoders are not just used for the top level Dataset creation. Dataset.groupByKey[K] requi

[GitHub] spark pull request #15979: [SPARK-18251][SQL] the type of Dataset can't be O...

2016-12-03 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/15979#discussion_r90770855 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala --- @@ -47,16 +47,26 @@ object ExpressionEncoder

[GitHub] spark pull request #15979: [SPARK-18251][SQL] the type of Dataset can't be O...

2016-12-03 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/15979#discussion_r90770824 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala --- @@ -47,16 +47,26 @@ object ExpressionEncoder

[GitHub] spark issue #15918: [SPARK-18122][SQL][WIP]Fallback to Kryo for unsupported ...

2016-12-03 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/15918 It can be done with shapeless (which perhaps uses macros under hood, I don't know). On Dec 1, 2016 19:56, "Michael Armbrust" <notificati...@github.com> wrote:

[GitHub] spark issue #15918: [SPARK-18122][SQL][WIP]Fallback to Kryo for unsupported ...

2016-11-30 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/15918 if we do a flag i would also prefer it if the current implicits are more narrow if the flag is not set, if possible. --- If your project is set up for it, you can reply to this email and have

[GitHub] spark issue #15918: [SPARK-18122][SQL][WIP]Fallback to Kryo for unsupported ...

2016-11-27 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/15918 @srowen and @rxin what is the default behavior that is changed here? i see a current situation where an implicit encoder is provided that simply cannot handle the task at hand and this leads

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-10-25 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13526 if they chain like that then i think i know how to do the optimization. but do they? look for example at dataset.groupByKey(...).mapValues(...) Dataset[T].groupByKey[K] uses

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-10-24 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13526 @cloud-fan that makes sense to me, but its definitely not a quick win to create that optimization. let me think about it some more --- If your project is set up for it, you can reply

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-10-21 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13526 @cloud-fan i can try to optimize ```grouped.mapValues(...).mapValues(...)``` but its a bit of an anti-pattern (there should be no need to do mapValues twice) so i dont think there is much

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-10-20 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13526 @rxin i can give it a try (the optimizer rule) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request #15382: [SPARK-17810] [SQL] Default spark.sql.warehouse.d...

2016-10-18 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/15382#discussion_r83921525 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -741,7 +741,7 @@ private[sql] class SQLConf extends Serializable

[GitHub] spark issue #15382: [SPARK-17810] [SQL] Default spark.sql.warehouse.dir is r...

2016-10-07 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/15382 i don't think there is such a thing as a HDFS working directory, but that probably means it just uses the home dir on hdfs (/user/) for any relative paths --- If your project is set up

[GitHub] spark issue #15382: [SPARK-17810] [SQL] Default spark.sql.warehouse.dir is r...

2016-10-06 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/15382 i think working dir makes more sense than home dir. but could this catch people by surprise because we now expect write permission in the working dir? --- If your project is set up

[GitHub] spark pull request #13868: [SPARK-15899] [SQL] Fix the construction of the f...

2016-10-06 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/13868#discussion_r82216818 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -55,7 +56,7 @@ object SQLConf { val WAREHOUSE_PATH

[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...

2016-09-21 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13512 @cloud-fan i thought about this a little more, and my suggested changes to the Aggregator api does not allow one to use a different encoder when applying a typed operation on Dataset. so i do

[GitHub] spark pull request #14576: [SPARK-16391][SQL] Support partial aggregation fo...

2016-08-17 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/14576#discussion_r75186632 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/expressions/ReduceAggregator.scala --- @@ -0,0 +1,63 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request #14576: [SPARK-16391][SQL] Support partial aggregation fo...

2016-08-17 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/14576#discussion_r75152186 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/expressions/ReduceAggregator.scala --- @@ -0,0 +1,63 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request #14576: [SPARK-16391][SQL] ReduceAggregator and partial a...

2016-08-10 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/14576#discussion_r74361702 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/expressions/ReduceAggregator.scala --- @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request #14576: [SPARK-16391][SQL] ReduceAggregator and partial a...

2016-08-10 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/14576#discussion_r74316735 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/expressions/ReduceAggregator.scala --- @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request #14576: [SPARK-16391][SQL] ReduceAggregator and partial a...

2016-08-10 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/14576#discussion_r74314375 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/expressions/ReduceAggregator.scala --- @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request #14576: [SPARK-16391][SQL] ReduceAggregator and partial a...

2016-08-10 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/14576#discussion_r74313912 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/expressions/ReduceAggregator.scala --- @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache

[GitHub] spark issue #14222: [SPARK-16391][SQL] KeyValueGroupedDataset.reduceGroups s...

2016-07-17 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/14222 there is a usefulness to this `ReduceAggregator` beyond `.reduceGroups`. basically you can take any Aggregator without a zero and turn it into a valid Aggregator, with the caveat being

[GitHub] spark pull request #13526: [SPARK-15780][SQL] Support mapValues on KeyValueG...

2016-07-15 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/13526#discussion_r71042267 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala --- @@ -312,6 +312,17 @@ class DatasetSuite extends QueryTest

[GitHub] spark pull request #13526: [SPARK-15780][SQL] Support mapValues on KeyValueG...

2016-07-15 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/13526#discussion_r71041725 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala --- @@ -65,6 +65,46 @@ class KeyValueGroupedDataset[K, V] private

[GitHub] spark pull request #13532: [SPARK-15204][SQL] improve nullability inference ...

2016-07-03 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/13532#discussion_r69397207 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala --- @@ -305,4 +305,13 @@ class DatasetAggregatorSuite extends

[GitHub] spark issue #13933: [SPARK-16236] [SQL] Add Path Option back to Load API in ...

2016-06-30 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13933 For parquet, json etc. path not being put in options is not an issue since they don't retrieve it from the options On Jun 29, 2016 2:31 AM, "Xiao Li" <notificati...@gith

[GitHub] spark pull request #13727: [SPARK-15982][SPARK-16009][SPARK-16007][SQL] Harm...

2016-06-27 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/13727#discussion_r68672691 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala --- @@ -135,7 +129,7 @@ class DataFrameReader private[sql](sparkSession

[GitHub] spark pull request #13727: [SPARK-15982][SPARK-16009][SPARK-16007][SQL] Harm...

2016-06-27 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/13727#discussion_r68645998 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala --- @@ -135,7 +129,7 @@ class DataFrameReader private[sql](sparkSession

[GitHub] spark pull request #13727: [SPARK-15982][SPARK-16009][SPARK-16007][SQL] Harm...

2016-06-27 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/13727#discussion_r68624316 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala --- @@ -135,7 +129,7 @@ class DataFrameReader private[sql](sparkSession

[GitHub] spark issue #8416: [SPARK-10185] [SQL] Feat sql comma separated paths

2016-06-11 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/8416 this patch should not have broken reading files that include comma. i also added unit test for this: https://github.com/apache/spark/pull/8416/files#diff

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-06-07 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13526 could we "rewind"/undo the append for the key and change it to a map that inserts new values and key? so remove one append and replace it with another operation? --- If your proj

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-06-07 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13526 the tricky part with that is that (ds: Dataset[(K, V)]).groupBy(_._1).mapValues(_._2) should return a KeyValueGroupedDataset[K, V] On Tue, Jun 7, 2016 at 8:22 PM, Wenchen Fan

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-06-07 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13526 ``` scala> val x = Seq(("a", 1), ("b", 2)).toDS x: org.apache.spark.sql.Dataset[(String, Int)] = [_1: string, _2: int] scala> x.groupByKey(_._1).ma

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-06-07 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13526 ok i will study the physical plans for both and try to understand why one would be slower --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-06-06 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13526 can you explain a bit what is inefficient and would need an optimizer rule? is it mapValues being called twice? once for the key and then for the new values? thanks! --- If your

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-06-06 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13526 see this conversation: https://mail-archives.apache.org/mod_mbox/spark-user/201602.mbox/%3ccaaswr-7kqfmxd_cpr-_wdygafh+rarecm9olm5jkxfk14fc...@mail.gmail.com%3E mapGroups

[GitHub] spark pull request #13532: [SPARK-15204][SQL] improve nullability inference ...

2016-06-06 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/13532#discussion_r65986613 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TypedAggregateExpression.scala --- @@ -51,7 +52,8 @@ object

[GitHub] spark pull request #13526: [SPARK-15780][SQL] Support mapValues on KeyValueG...

2016-06-06 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/13526#discussion_r65972115 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala --- @@ -65,6 +65,44 @@ class KeyValueGroupedDataset[K, V] private

[GitHub] spark pull request #13532: [SPARK-15204][SQL] improve nullability inference ...

2016-06-06 Thread koertkuipers
GitHub user koertkuipers opened a pull request: https://github.com/apache/spark/pull/13532 [SPARK-15204][SQL] improve nullability inference for Aggregator ## What changes were proposed in this pull request? TypedAggregateExpression sets nullable based on the schema

[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...

2016-06-06 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13512 for example with this branch you can do: ``` val df3 = Seq(("a", "x", 1), ("a", "y", 3), ("b", "x", 3)).toDF("i"

[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...

2016-06-06 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13512 well that was sort of what i was trying to achieve. the unit tests i added were for using Aggregator for untyped grouping(```groupBy```). and i think for it to be useful within

[GitHub] spark pull request #13512: [SPARK-15769][SQL] Add Encoder for input type to ...

2016-06-06 Thread koertkuipers
Github user koertkuipers closed the pull request at: https://github.com/apache/spark/pull/13512 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...

2016-06-06 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13512 If Aggregator is designed for typed Dataset only then that is a bit of a shame, because its a elegant and generic api that should be useful for DataFrame too. this causes fragmentation

[GitHub] spark pull request #13526: [SPARK-15780][SQL] Support mapValues on KeyValueG...

2016-06-06 Thread koertkuipers
GitHub user koertkuipers opened a pull request: https://github.com/apache/spark/pull/13526 [SPARK-15780][SQL] Support mapValues on KeyValueGroupedDataset ## What changes were proposed in this pull request? Add mapValues to KeyValueGroupedDataset ## How

[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...

2016-06-05 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13512 @cloud-fan i am running into some trouble updating my branch to the latest master. i get errors in tests due to Analyzer.validateTopLevelTupleFields the issue seems

[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...

2016-06-05 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13512 @cloud-fan from the (added) unit tests: ``` val df2 = Seq("a" -> 1, "a" -> 3, "b" -> 3).toDF("i", "j") checkAnswer(df2.grou

[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...

2016-06-04 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13512 **[Test build #5 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/5/consoleFull)** for PR 13512 at commit [`077f782`](https://github.com/apache/spark

[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...

2016-06-04 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13512 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/5/ Test FAILed

[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...

2016-06-04 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13512 Build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...

2016-06-04 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13512 **[Test build #5 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/5/consoleFull)** for PR 13512 at commit [`077f782`](https://github.com/apache/spark

[GitHub] spark pull request #13512: [SPARK-15769][SQL] Add Encoder for input type to ...

2016-06-04 Thread koertkuipers
GitHub user koertkuipers opened a pull request: https://github.com/apache/spark/pull/13512 [SPARK-15769][SQL] Add Encoder for input type to Aggregator ## What changes were proposed in this pull request? Aggregator also has an Encoder for the input type ## How

[GitHub] spark pull request: SPARK-14139 Dataset loses nullability in opera...

2016-05-19 Thread koertkuipers
Github user koertkuipers closed the pull request at: https://github.com/apache/spark/pull/11980 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark pull request: [SPARK-15204][SQL] Nullable is not correct for...

2016-05-09 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/13012#issuecomment-218053856 blackbox transformations infer nullable=false when you return a primitive. for example: ``` scala> sc.parallelize(List(1,2,3)).toDS.map(i => i * 2).

[GitHub] spark pull request: [SPARK-15097][SQL] make Dataset.sqlContext a s...

2016-05-03 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/12877#issuecomment-216678197 yup needs to be transient, will fix On Tue, May 3, 2016 at 5:58 PM, andrewor14 <notificati...@github.com> wrote: > I thin

[GitHub] spark pull request: [SPARK-15097][SQL] make Dataset.sqlContext a s...

2016-05-03 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/12877#issuecomment-216675245 if a SparkSession sits inside a Dataset does that mean _wrapped is always already initialized (because you cannot have a Dataset without a SparkContext

[GitHub] spark pull request: [SPARK-15097][SQL] make Dataset.sqlContext a s...

2016-05-03 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/12877#issuecomment-216670925 i made it lazy val since SparkSession.wrapped is effectively lazy too: protected[sql] def wrapped: SQLContext = { if (_wrapped == null

[GitHub] spark pull request: [SPARK-15097][SQL] make Dataset.sqlContext a s...

2016-05-03 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/12877#issuecomment-216670423 oh since since sparkSession is just a normal val i guess it can also be On Tue, May 3, 2016 at 5:25 PM, andrewor14 <notificati...@github.com>

[GitHub] spark pull request: [SPARK-15097][SQL] make Dataset.sqlContext a s...

2016-05-03 Thread koertkuipers
GitHub user koertkuipers opened a pull request: https://github.com/apache/spark/pull/12877 [SPARK-15097][SQL] make Dataset.sqlContext a stable identifier for imports ## What changes were proposed in this pull request? Make Dataset.sqlContext a lazy val so that its a stable

  1   2   >