[GitHub] spark pull request #22976: [SPARK-25974][SQL]Optimizes Generates bytecode fo...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22976#discussion_r232552336 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateOrdering.scala --- @@ -68,62 +68,55 @@ object GenerateOrdering extends CodeGenerator[Seq[SortOrder], Ordering[InternalR genComparisons(ctx, ordering) } + /** + * Creates the variables for ordering based on the given order. + */ + private def createOrderKeys( +ctx: CodegenContext, --- End diff -- 4 space identation --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22955: [SPARK-25949][SQL] Add test for PullOutPythonUDFInJoinCo...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22955 thanks, merging to master! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22938: [SPARK-25935][SQL] Prevent null rows from JSON pa...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22938#discussion_r232550860 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -1813,6 +1817,7 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { val path = dir.getCanonicalPath primitiveFieldAndType .toDF("value") +.repartition(1) --- End diff -- why is the `repartition` required? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22938: [SPARK-25935][SQL] Prevent null rows from JSON pa...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22938#discussion_r232550733 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -1115,6 +1115,7 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { Row(null, null, null), Row(null, null, null), Row(null, null, null), +Row(null, null, null), --- End diff -- so for json data source, previous behavior is, we would skip the row even it's in PERMISSIVE mode. Shall we clearly mention it in the migration guide? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22938: [SPARK-25935][SQL] Prevent null rows from JSON pa...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22938#discussion_r232550502 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -550,15 +550,23 @@ case class JsonToStructs( s"Input schema ${nullableSchema.catalogString} must be a struct, an array or a map.") } - // This converts parsed rows to the desired output by the given schema. @transient - lazy val converter = nullableSchema match { -case _: StructType => - (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next() else null -case _: ArrayType => - (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next().getArray(0) else null -case _: MapType => - (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next().getMap(0) else null + private lazy val castRow = nullableSchema match { +case _: StructType => (row: InternalRow) => row +case _: ArrayType => (row: InternalRow) => row.getArray(0) +case _: MapType => (row: InternalRow) => row.getMap(0) + } + + // This converts parsed rows to the desired output by the given schema. + private def convertRow(rows: Iterator[InternalRow]) = { +if (rows.hasNext) { + val result = rows.next() + // JSON's parser produces one record only. + assert(!rows.hasNext) + castRow(result) +} else { + throw new IllegalArgumentException("Expected one row from JSON parser.") --- End diff -- This can only happen when we have a bug, right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22938: [SPARK-25935][SQL] Prevent null rows from JSON pa...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22938#discussion_r232550186 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -15,6 +15,8 @@ displayTitle: Spark SQL Upgrading Guide - Since Spark 3.0, the `from_json` functions supports two modes - `PERMISSIVE` and `FAILFAST`. The modes can be set via the `mode` option. The default mode became `PERMISSIVE`. In previous versions, behavior of `from_json` did not conform to either `PERMISSIVE` nor `FAILFAST`, especially in processing of malformed JSON records. For example, the JSON string `{"a" 1}` with the schema `a INT` is converted to `null` by previous versions but Spark 3.0 converts it to `Row(null)`. + - In Spark version 2.4 and earlier, JSON data source and the `from_json` function produced `null`s if there is no valid root JSON token in its input (` ` for example). Since Spark 3.0, such input is treated as a bad record and handled according to specified mode. For example, in the `PERMISSIVE` mode the ` ` input is converted to `Row(null, null)` if specified schema is `key STRING, value INT`. --- End diff -- > In Spark version 2.4 and earlier, JSON data source and the `from_json` function produced `null`s Shall we update this? According to what you said, JSON data source can't produce null. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22998: [SPARK-26001][SQL]Reduce memory copy when writing decima...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22998 I think this is wrong. We have to zero out the bytes even writing a null decimal, so that 2 unsafe rows with same values(including null values) are exactly same(in binary format). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22990: [SPARK-25988] [SQL] Keep names unchanged when deduplicat...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22990 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22990: [SPARK-25988] [SQL] Keep names unchanged when deduplicat...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22990 good catch! LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22990: [SPARK-25988] [SQL] Keep names unchanged when ded...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22990#discussion_r232148751 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala --- @@ -2856,6 +2856,59 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { checkAnswer(sql("select 26393499451 / (1e6 * 1000)"), Row(BigDecimal("26.393499451"))) } } + + test("self join with aliases on partitioned tables #1") { --- End diff -- let's put the JIRA ticket number in the test name --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22990: [SPARK-25988] [SQL] Keep names unchanged when ded...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22990#discussion_r232148583 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala --- @@ -2856,6 +2856,59 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { checkAnswer(sql("select 26393499451 / (1e6 * 1000)"), Row(BigDecimal("26.393499451"))) } } + + test("self join with aliases on partitioned tables #1") { +withTempView("tmpView1", "tmpView2") { + withTable("tab1", "tab2") { +sql( + """ +|CREATE TABLE `tab1` (`col1` INT, `TDATE` DATE) +|USING CSV +|PARTITIONED BY (TDATE) + """.stripMargin) +spark.table("tab1").where("TDATE >= '2017-08-15'").createOrReplaceTempView("tmpView1") +sql("CREATE TABLE `tab2` (`TDATE` DATE) USING parquet") +sql( + """ +|CREATE OR REPLACE TEMPORARY VIEW tmpView2 AS +|SELECT N.tdate, col1 AS aliasCol1 +|FROM tmpView1 N +|JOIN tab2 Z +|ON N.tdate = Z.tdate + """.stripMargin) +withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "0") { + sql("SELECT * FROM tmpView2 x JOIN tmpView2 y ON x.tdate = y.tdate").collect() +} + } +} + } + + test("self join with aliases on partitioned tables #2") { +withTempView("tmp") { + withTable("tab1", "tab2") { +sql( + """ +|CREATE TABLE `tab1` (`EX` STRING, `TDATE` DATE) +|USING parquet +|PARTITIONED BY (tdate) + """.stripMargin) +sql("CREATE TABLE `tab2` (`TDATE` DATE) USING parquet") +sql( + """ +|CREATE OR REPLACE TEMPORARY VIEW TMP as +|SELECT N.tdate, EX AS new_ex +|FROM tab1 N +|JOIN tab2 Z +|ON N.tdate = Z.tdate --- End diff -- nit: `ON N.tdate = Z.tdate` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22987: [SPARK-25979][SQL] Window function: allow parenth...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22987#discussion_r232135028 --- Diff: sql/core/src/test/resources/sql-tests/inputs/window.sql --- @@ -109,3 +109,9 @@ last_value(false, false) OVER w AS last_value_contain_null FROM testData WINDOW w AS () ORDER BY cate, val; + +-- parentheses around window reference +SELECT cate, sum(val) OVER (w) +FROM testData +WHERE val is not null +WINDOW w AS (PARTITION BY cate ORDER BY val); --- End diff -- need a new line at the end. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22987: [SPARK-25979][SQL] Window function: allow parenth...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22987#discussion_r231992909 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/SQLWindowFunctionSuite.scala --- @@ -31,6 +32,19 @@ class SQLWindowFunctionSuite extends QueryTest with SharedSQLContext { import testImplicits._ + val empSalaryData = Seq( --- End diff -- We really just need a simple test that proves `over (w)` works. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22987: [SPARK-25979][SQL] Window function: allow parenth...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22987#discussion_r231992777 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/SQLWindowFunctionSuite.scala --- @@ -31,6 +32,19 @@ class SQLWindowFunctionSuite extends QueryTest with SharedSQLContext { import testImplicits._ + val empSalaryData = Seq( --- End diff -- can we just add a new test in `window.sql`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22978: [SPARK-25676][SQL][FOLLOWUP] Use 'foreach(_ => ())'
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22978 thanks, merging to master! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22984: [minor] update HiveExternalCatalogVersionsSuite t...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22984#discussion_r231922622 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala --- @@ -206,7 +206,7 @@ class HiveExternalCatalogVersionsSuite extends SparkSubmitTestUtils { object PROCESS_TABLES extends QueryTest with SQLTestUtils { // Tests the latest version of every release line. - val testingVersions = Seq("2.1.3", "2.2.2", "2.3.2") + val testingVersions = Seq("2.1.3", "2.2.2", "2.3.2", "2.4.0") --- End diff -- when will we drop 2.1 officially? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22984: [minor] update HiveExternalCatalogVersionsSuite to test ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22984 cc @gatorsmile @srowen --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22984: [minor ]update HiveExternalCatalogVersionsSuite t...
GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/22984 [minor ]update HiveExternalCatalogVersionsSuite to test 2.4.0 ## What changes were proposed in this pull request? Since Spark 2.4.0 is released, we should test it in HiveExternalCatalogVersionsSuite ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark minor Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22984.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22984 commit b9817b4e882a8fafc4a8eba498b969e513e64fa3 Author: Wenchen Fan Date: 2018-11-08T14:58:54Z update HiveExternalCatalogVersionsSuite to test 2.4.0 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22977: [BUILD] Bump previousSparkVersion in MimaBuild.sc...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22977#discussion_r231872904 --- Diff: project/MimaExcludes.scala --- @@ -84,7 +84,17 @@ object MimaExcludes { ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.streaming.api.java.JavaPairDStream.flatMapValues"), // [SPARK-25680] SQL execution listener shouldn't happen on execution thread ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.sql.util.ExecutionListenerManager.clone"), - ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.util.ExecutionListenerManager.this") + ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.util.ExecutionListenerManager.this"), +// [SPARK-25862][SQL] Remove rangeBetween APIs introduced in SPARK-21608 + ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.functions.unboundedFollowing"), + ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.functions.unboundedPreceding"), + ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.functions.currentRow"), + ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.sql.expressions.Window.rangeBetween"), + ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.sql.expressions.WindowSpec.rangeBetween"), +// [SPARK-23781][CORE] Merge token renewer functionality into HadoopDelegationTokenManager + ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.deploy.SparkHadoopUtil.nextCredentialRenewalTime"), --- End diff -- This is actually a private method, I'm not sure why mima tracks it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22977: [BUILD] Bump previousSparkVersion in MimaBuild.sc...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22977#discussion_r231866958 --- Diff: project/MimaBuild.scala --- @@ -88,7 +88,7 @@ object MimaBuild { def mimaSettings(sparkHome: File, projectRef: ProjectRef) = { val organization = "org.apache.spark" -val previousSparkVersion = "2.2.0" +val previousSparkVersion = "2.4.0" --- End diff -- up to my understanding, we should have changed it to 2.3.0 when 2.3.0 was released. Maybe we should send a another PR to branch-2.4 and change it to 2.3.0 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22977: [BUILD] Bump previousSparkVersion in MimaBuild.scala to ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22977 I think one major issue is, there is no document about how to update mima with new releases. Anyone knows the detailed process? Seems we need to update `MimaExcludes.scala` with something like `lazy val v30excludes = v24excludes ++ Seq` when cut the new branch, and update `MimaBuild.scala` when the new release is published. And there are 2 remaining issues. 1. the data source v2 changes broke a lot of mima rules, while I expect interfaces marked as "Envolving" should not be tracked by mima. 2. mllib broke a lot of mima rules, seems caused by https://github.com/apache/spark/pull/22921 . @srowen can you take a look? Also cc @JoshRosen @gatorsmile @shaneknapp @vanzin @holdenk @felixcheung --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22977: [BUILD] Bump previousSparkVersion in MimaBuild.sc...
GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/22977 [BUILD] Bump previousSparkVersion in MimaBuild.scala to be 2.4.0 ## What changes were proposed in this pull request? Since Spark 2.4.0 is already in maven repo, we can Bump previousSparkVersion in MimaBuild.scala to be 2.4.0. Note that, seems we forgot to do it for 2.4.0, so this PR also updates MimaExcludes.scala ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark mima Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22977.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22977 commit a0357782e75967d309dc5229b3c36a0c295f2956 Author: Wenchen Fan Date: 2018-11-08T11:34:31Z Bump previousSparkVersion in MimaBuild.scala to be 2.2.0 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22970: [SPARK-25676][FOLLOWUP][BUILD] Fix Scala 2.12 bui...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22970#discussion_r231833555 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/WideTableBenchmark.scala --- @@ -42,7 +43,7 @@ object WideTableBenchmark extends SqlBasedBenchmark { Seq("10", "100", "1024", "2048", "4096", "8192", "65536").foreach { n => benchmark.addCase(s"split threshold $n", numIters = 5) { iter => withSQLConf(SQLConf.CODEGEN_METHOD_SPLIT_THRESHOLD.key -> n) { -df.selectExpr(columns: _*).foreach(identity(_)) +df.selectExpr(columns: _*).foreach((x => x): Row => Unit) --- End diff -- shall we use `foreach(_ => ())`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22938: [SPARK-25935][SQL] Prevent null rows from JSON pa...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22938#discussion_r231762733 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -550,15 +550,33 @@ case class JsonToStructs( s"Input schema ${nullableSchema.catalogString} must be a struct, an array or a map.") } - // This converts parsed rows to the desired output by the given schema. @transient - lazy val converter = nullableSchema match { -case _: StructType => - (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next() else null -case _: ArrayType => - (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next().getArray(0) else null -case _: MapType => - (rows: Iterator[InternalRow]) => if (rows.hasNext) rows.next().getMap(0) else null + private lazy val castRow = nullableSchema match { +case _: StructType => (row: InternalRow) => row +case _: ArrayType => (row: InternalRow) => + if (row.isNullAt(0)) { +new GenericArrayData(Array()) --- End diff -- I think this is the place `from_json` is different from json data source. A data source must produce data as rows, while the `from_json` can return array or map. I think the previous behavior also makes sense. For array/map, we don't have the corrupted column, and returning null is reasonable. Actually I prefer null over empty array/map, but we need more discussion about this behavior. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22938: [SPARK-25935][SQL] Prevent null rows from JSON pa...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22938#discussion_r231745125 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -15,6 +15,8 @@ displayTitle: Spark SQL Upgrading Guide - Since Spark 3.0, the `from_json` functions supports two modes - `PERMISSIVE` and `FAILFAST`. The modes can be set via the `mode` option. The default mode became `PERMISSIVE`. In previous versions, behavior of `from_json` did not conform to either `PERMISSIVE` nor `FAILFAST`, especially in processing of malformed JSON records. For example, the JSON string `{"a" 1}` with the schema `a INT` is converted to `null` by previous versions but Spark 3.0 converts it to `Row(null)`. + - In Spark version 2.4 and earlier, JSON data source and the `from_json` function produced `null`s if there is no valid root JSON token in its input (` ` for example). Since Spark 3.0, such input is treated as a bad record and handled according to specified mode. For example, in the `PERMISSIVE` mode the ` ` input is converted to `Row(null, null)` if specified schema is `key STRING, value INT`. --- End diff -- just for curiosity, how can the json data source return null rows? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22958: [SPARK-25952][SQL] Passing actual schema to JacksonParse...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22958 good catch! do we need to fix the CSV side? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22818: [SPARK-25904][CORE] Allocate arrays smaller than Int.Max...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22818 since this is a bug fix, shall we also backport it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22943: [SPARK-25098][SQL] Trim the string when cast stri...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22943#discussion_r231382309 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala --- @@ -140,16 +140,10 @@ class DateTimeUtilsSuite extends SparkFunSuite { c = Calendar.getInstance() c.set(2015, 2, 18, 0, 0, 0) c.set(Calendar.MILLISECOND, 0) -assert(stringToDate(UTF8String.fromString("2015-03-18")).get === - millisToDays(c.getTimeInMillis)) -assert(stringToDate(UTF8String.fromString("2015-03-18 ")).get === - millisToDays(c.getTimeInMillis)) -assert(stringToDate(UTF8String.fromString("2015-03-18 123142")).get === - millisToDays(c.getTimeInMillis)) -assert(stringToDate(UTF8String.fromString("2015-03-18T123123")).get === - millisToDays(c.getTimeInMillis)) -assert(stringToDate(UTF8String.fromString("2015-03-18T")).get === - millisToDays(c.getTimeInMillis)) +Seq("2015-03-18", "2015-03-18 ", " 2015-03-18", " 2015-03-18 ", "2015-03-18 123142", --- End diff -- ah i see --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22943: [SPARK-25098][SQL] Trim the string when cast stri...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22943#discussion_r231380552 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala --- @@ -140,16 +140,10 @@ class DateTimeUtilsSuite extends SparkFunSuite { c = Calendar.getInstance() c.set(2015, 2, 18, 0, 0, 0) c.set(Calendar.MILLISECOND, 0) -assert(stringToDate(UTF8String.fromString("2015-03-18")).get === - millisToDays(c.getTimeInMillis)) -assert(stringToDate(UTF8String.fromString("2015-03-18 ")).get === - millisToDays(c.getTimeInMillis)) -assert(stringToDate(UTF8String.fromString("2015-03-18 123142")).get === - millisToDays(c.getTimeInMillis)) -assert(stringToDate(UTF8String.fromString("2015-03-18T123123")).get === - millisToDays(c.getTimeInMillis)) -assert(stringToDate(UTF8String.fromString("2015-03-18T")).get === - millisToDays(c.getTimeInMillis)) +Seq("2015-03-18", "2015-03-18 ", " 2015-03-18", " 2015-03-18 ", "2015-03-18 123142", --- End diff -- the test result doesn't change? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22956: [SPARK-25950][SQL] from_csv should respect to spark.sql....
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22956 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22956: [SPARK-25950][SQL] from_csv should respect to spa...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22956#discussion_r231359024 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala --- @@ -92,8 +93,14 @@ case class CsvToStructs( } } + val nameOfCorruptRecord = SQLConf.get.getConf(SQLConf.COLUMN_NAME_OF_CORRUPT_RECORD) --- End diff -- should this be private? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22944: [SPARK-25942][SQL] Fix Dataset.groupByKey to make...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22944#discussion_r231358749 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala --- @@ -262,25 +262,39 @@ object AppendColumns { def apply[T : Encoder, U : Encoder]( func: T => U, child: LogicalPlan): AppendColumns = { +val outputEncoder = encoderFor[U] +val namedExpressions = if (!outputEncoder.isSerializedAsStruct) { + assert(outputEncoder.namedExpressions.length == 1) + outputEncoder.namedExpressions.map(Alias(_, "key")()) +} else { + outputEncoder.namedExpressions +} new AppendColumns( func.asInstanceOf[Any => Any], implicitly[Encoder[T]].clsTag.runtimeClass, implicitly[Encoder[T]].schema, UnresolvedDeserializer(encoderFor[T].deserializer), - encoderFor[U].namedExpressions, + namedExpressions, child) } def apply[T : Encoder, U : Encoder]( func: T => U, inputAttributes: Seq[Attribute], child: LogicalPlan): AppendColumns = { +val outputEncoder = encoderFor[U] +val namedExpressions = if (!outputEncoder.isSerializedAsStruct) { + assert(outputEncoder.namedExpressions.length == 1) + outputEncoder.namedExpressions.map(Alias(_, "key")()) +} else { + outputEncoder.namedExpressions --- End diff -- I wouldn't special-case primitive type while this is a general problem. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22944: [SPARK-25942][SQL] Fix Dataset.groupByKey to make...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22944#discussion_r231358690 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala --- @@ -262,25 +262,39 @@ object AppendColumns { def apply[T : Encoder, U : Encoder]( func: T => U, child: LogicalPlan): AppendColumns = { +val outputEncoder = encoderFor[U] +val namedExpressions = if (!outputEncoder.isSerializedAsStruct) { + assert(outputEncoder.namedExpressions.length == 1) + outputEncoder.namedExpressions.map(Alias(_, "key")()) +} else { + outputEncoder.namedExpressions +} new AppendColumns( func.asInstanceOf[Any => Any], implicitly[Encoder[T]].clsTag.runtimeClass, implicitly[Encoder[T]].schema, UnresolvedDeserializer(encoderFor[T].deserializer), - encoderFor[U].namedExpressions, + namedExpressions, child) } def apply[T : Encoder, U : Encoder]( func: T => U, inputAttributes: Seq[Attribute], child: LogicalPlan): AppendColumns = { +val outputEncoder = encoderFor[U] +val namedExpressions = if (!outputEncoder.isSerializedAsStruct) { + assert(outputEncoder.namedExpressions.length == 1) + outputEncoder.namedExpressions.map(Alias(_, "key")()) +} else { + outputEncoder.namedExpressions --- End diff -- I wouldn't special-case primitive type while this is a general problem. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22944: [SPARK-25942][SQL] Fix Dataset.groupByKey to make...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22944#discussion_r231201502 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala --- @@ -262,25 +262,39 @@ object AppendColumns { def apply[T : Encoder, U : Encoder]( func: T => U, child: LogicalPlan): AppendColumns = { +val outputEncoder = encoderFor[U] +val namedExpressions = if (!outputEncoder.isSerializedAsStruct) { + assert(outputEncoder.namedExpressions.length == 1) + outputEncoder.namedExpressions.map(Alias(_, "key")()) +} else { + outputEncoder.namedExpressions +} new AppendColumns( func.asInstanceOf[Any => Any], implicitly[Encoder[T]].clsTag.runtimeClass, implicitly[Encoder[T]].schema, UnresolvedDeserializer(encoderFor[T].deserializer), - encoderFor[U].namedExpressions, + namedExpressions, child) } def apply[T : Encoder, U : Encoder]( func: T => U, inputAttributes: Seq[Attribute], child: LogicalPlan): AppendColumns = { +val outputEncoder = encoderFor[U] +val namedExpressions = if (!outputEncoder.isSerializedAsStruct) { + assert(outputEncoder.namedExpressions.length == 1) + outputEncoder.namedExpressions.map(Alias(_, "key")()) +} else { + outputEncoder.namedExpressions --- End diff -- I mean, maybe we should just leave this problem. I'm not sure how hacky it is to detect the `AppendColumns` in this case. Maybe we can have more confidence if you have a PR ready. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22873: [SPARK-25866][ML] Update KMeans formatVersion
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22873 thanks, merging to master/2.4! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22944: [SPARK-25942][SQL] Fix Dataset.groupByKey to make...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22944#discussion_r231159305 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala --- @@ -262,25 +262,39 @@ object AppendColumns { def apply[T : Encoder, U : Encoder]( func: T => U, child: LogicalPlan): AppendColumns = { +val outputEncoder = encoderFor[U] +val namedExpressions = if (!outputEncoder.isSerializedAsStruct) { + assert(outputEncoder.namedExpressions.length == 1) + outputEncoder.namedExpressions.map(Alias(_, "key")()) +} else { + outputEncoder.namedExpressions +} new AppendColumns( func.asInstanceOf[Any => Any], implicitly[Encoder[T]].clsTag.runtimeClass, implicitly[Encoder[T]].schema, UnresolvedDeserializer(encoderFor[T].deserializer), - encoderFor[U].namedExpressions, + namedExpressions, child) } def apply[T : Encoder, U : Encoder]( func: T => U, inputAttributes: Seq[Attribute], child: LogicalPlan): AppendColumns = { +val outputEncoder = encoderFor[U] +val namedExpressions = if (!outputEncoder.isSerializedAsStruct) { + assert(outputEncoder.namedExpressions.length == 1) + outputEncoder.namedExpressions.map(Alias(_, "key")()) +} else { + outputEncoder.namedExpressions --- End diff -- ah i see. Then maybe we should just leave it instead of hacking the `AttributeSeq.resolve`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22944: [SPARK-25942][SQL] Fix Dataset.groupByKey to make...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22944#discussion_r231129654 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala --- @@ -262,25 +262,39 @@ object AppendColumns { def apply[T : Encoder, U : Encoder]( func: T => U, child: LogicalPlan): AppendColumns = { +val outputEncoder = encoderFor[U] +val namedExpressions = if (!outputEncoder.isSerializedAsStruct) { + assert(outputEncoder.namedExpressions.length == 1) + outputEncoder.namedExpressions.map(Alias(_, "key")()) +} else { + outputEncoder.namedExpressions +} new AppendColumns( func.asInstanceOf[Any => Any], implicitly[Encoder[T]].clsTag.runtimeClass, implicitly[Encoder[T]].schema, UnresolvedDeserializer(encoderFor[T].deserializer), - encoderFor[U].namedExpressions, + namedExpressions, child) } def apply[T : Encoder, U : Encoder]( func: T => U, inputAttributes: Seq[Attribute], child: LogicalPlan): AppendColumns = { +val outputEncoder = encoderFor[U] +val namedExpressions = if (!outputEncoder.isSerializedAsStruct) { + assert(outputEncoder.namedExpressions.length == 1) + outputEncoder.namedExpressions.map(Alias(_, "key")()) +} else { + outputEncoder.namedExpressions --- End diff -- > Should we only fail the groupByKey query accessing ambiguous field names? Yes. When we have unresolved attributes, check if the child plan is `AppendColumns`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22944: [SPARK-25942][SQL] Fix Dataset.groupByKey to make...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22944#discussion_r230997935 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala --- @@ -262,25 +262,39 @@ object AppendColumns { def apply[T : Encoder, U : Encoder]( func: T => U, child: LogicalPlan): AppendColumns = { +val outputEncoder = encoderFor[U] +val namedExpressions = if (!outputEncoder.isSerializedAsStruct) { + assert(outputEncoder.namedExpressions.length == 1) + outputEncoder.namedExpressions.map(Alias(_, "key")()) +} else { + outputEncoder.namedExpressions +} new AppendColumns( func.asInstanceOf[Any => Any], implicitly[Encoder[T]].clsTag.runtimeClass, implicitly[Encoder[T]].schema, UnresolvedDeserializer(encoderFor[T].deserializer), - encoderFor[U].namedExpressions, + namedExpressions, child) } def apply[T : Encoder, U : Encoder]( func: T => U, inputAttributes: Seq[Attribute], child: LogicalPlan): AppendColumns = { +val outputEncoder = encoderFor[U] +val namedExpressions = if (!outputEncoder.isSerializedAsStruct) { + assert(outputEncoder.namedExpressions.length == 1) + outputEncoder.namedExpressions.map(Alias(_, "key")()) +} else { + outputEncoder.namedExpressions --- End diff -- is this a special case of option of product? can you try pritimive type and product type? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22547: [SPARK-25528][SQL] data source V2 read side API r...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22547#discussion_r230989559 --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousInputStream.scala --- @@ -46,17 +45,22 @@ import org.apache.spark.sql.types.StructType * scenarios, where some offsets after the specified initial ones can't be * properly read. */ -class KafkaContinuousReadSupport( +class KafkaContinuousInputStream( --- End diff -- Yea I'll separate this PR into 3 smaller ones, after we have agreed on the high-level design at https://docs.google.com/document/d/1uUmKCpWLdh9vHxP7AWJ9EgbwB_U6T3EJYNjhISGmiQg/edit?usp=sharing --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22944: [SPARK-25942][SQL] Fix Dataset.groupByKey to make...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22944#discussion_r230989073 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala --- @@ -262,25 +262,39 @@ object AppendColumns { def apply[T : Encoder, U : Encoder]( func: T => U, child: LogicalPlan): AppendColumns = { +val outputEncoder = encoderFor[U] +val namedExpressions = if (!outputEncoder.isSerializedAsStruct) { + assert(outputEncoder.namedExpressions.length == 1) + outputEncoder.namedExpressions.map(Alias(_, "key")()) +} else { + outputEncoder.namedExpressions +} new AppendColumns( func.asInstanceOf[Any => Any], implicitly[Encoder[T]].clsTag.runtimeClass, implicitly[Encoder[T]].schema, UnresolvedDeserializer(encoderFor[T].deserializer), - encoderFor[U].namedExpressions, + namedExpressions, child) } def apply[T : Encoder, U : Encoder]( func: T => U, inputAttributes: Seq[Attribute], child: LogicalPlan): AppendColumns = { +val outputEncoder = encoderFor[U] +val namedExpressions = if (!outputEncoder.isSerializedAsStruct) { + assert(outputEncoder.namedExpressions.length == 1) + outputEncoder.namedExpressions.map(Alias(_, "key")()) +} else { + outputEncoder.namedExpressions --- End diff -- We can improve the `CheckAnalysis` to detect this case, and improve the error message to ask users to do alias. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22944: [SPARK-25942][SQL] Fix Dataset.groupByKey to make...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22944#discussion_r230986226 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala --- @@ -262,25 +262,39 @@ object AppendColumns { def apply[T : Encoder, U : Encoder]( func: T => U, child: LogicalPlan): AppendColumns = { +val outputEncoder = encoderFor[U] +val namedExpressions = if (!outputEncoder.isSerializedAsStruct) { + assert(outputEncoder.namedExpressions.length == 1) + outputEncoder.namedExpressions.map(Alias(_, "key")()) +} else { + outputEncoder.namedExpressions +} new AppendColumns( func.asInstanceOf[Any => Any], implicitly[Encoder[T]].clsTag.runtimeClass, implicitly[Encoder[T]].schema, UnresolvedDeserializer(encoderFor[T].deserializer), - encoderFor[U].namedExpressions, + namedExpressions, child) } def apply[T : Encoder, U : Encoder]( func: T => U, inputAttributes: Seq[Attribute], child: LogicalPlan): AppendColumns = { +val outputEncoder = encoderFor[U] +val namedExpressions = if (!outputEncoder.isSerializedAsStruct) { + assert(outputEncoder.namedExpressions.length == 1) + outputEncoder.namedExpressions.map(Alias(_, "key")()) +} else { + outputEncoder.namedExpressions --- End diff -- if that is the case, I feel it better to ask users to resolve conflict manually, by adding alias. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22944: [SPARK-25942][SQL] Fix Dataset.groupByKey to make...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22944#discussion_r230977212 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala --- @@ -262,25 +262,39 @@ object AppendColumns { def apply[T : Encoder, U : Encoder]( func: T => U, child: LogicalPlan): AppendColumns = { +val outputEncoder = encoderFor[U] +val namedExpressions = if (!outputEncoder.isSerializedAsStruct) { + assert(outputEncoder.namedExpressions.length == 1) + outputEncoder.namedExpressions.map(Alias(_, "key")()) +} else { + outputEncoder.namedExpressions +} new AppendColumns( func.asInstanceOf[Any => Any], implicitly[Encoder[T]].clsTag.runtimeClass, implicitly[Encoder[T]].schema, UnresolvedDeserializer(encoderFor[T].deserializer), - encoderFor[U].namedExpressions, + namedExpressions, child) } def apply[T : Encoder, U : Encoder]( func: T => U, inputAttributes: Seq[Attribute], child: LogicalPlan): AppendColumns = { +val outputEncoder = encoderFor[U] +val namedExpressions = if (!outputEncoder.isSerializedAsStruct) { + assert(outputEncoder.namedExpressions.length == 1) + outputEncoder.namedExpressions.map(Alias(_, "key")()) +} else { + outputEncoder.namedExpressions --- End diff -- so we may still fail if `T` and `U` are case classes and have conflict field names? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22928: [SPARK-25926][CORE] Move config entries in core module t...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22928 thanks, merging to master! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22946: [SPARK-25943][SQL] Fail if mismatching nested struct fie...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22946 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22946: [SPARK-25943][SQL] Fail if mismatching nested struct fie...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22946 ah good catch! Can you also add a test? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22502: [SPARK-25474][SQL]When the "fallBackToHdfsForStats= true...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22502 @shahidki31 thanks for fixing it! Do you know where we read `fallBackToHdfsForStats` currently and see if we can have a unified place to do it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22889: [SPARK-25882][SQL] Added a function to join two datasets...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22889 I think the problem is real, maybe we should not use `Seq` in the end-user API, but always use Array to be more Java-friendly. This can also avoid bugs like https://github.com/apache/spark/pull/22789 cc @rxin @hvanhovell what do you think? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22949: [minor] update known_translations
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22949 Note that, these updates are generated by the script not me. If someone is not in the list, it means the script can figure out the full name without translation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22949: [minor] update known_translations
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22949#discussion_r230970453 --- Diff: dev/create-release/known_translations --- @@ -203,3 +203,61 @@ shenh062326 - Shen Hong aokolnychyi - Anton Okolnychyi linbojin - Linbo Jin lw-lin - Liwei Lin +10110346 - Xian Liu +Achuth17 - Achuth Narayan Rajagopal +Adamyuanyuan - Adam Wang +DylanGuedes - Dylan Guedes +JiahuiJiang - Jiahui Jiang +KevinZwx - Kevin Zhang +LantaoJin - Lantao Jin +Lemonjing - Rann Tao +LucaCanali - Luca Canali +XD-DENG - Xiaodong Deng +aai95 - Aleksei Izmalkin +akonopko - Alexander Konopko +ankuriitg - Ankur Gupta +arucard21 - Riaas Mokiem +attilapiros - Attila Zsolt Piros +bravo-zhang - Bravo Zhang +caneGuy - Kang Zhou +chaoslawful - Xiaozhe Wang +cluo512 - Chuan Luo +codeatri - Neha Patil +crafty-coder - Carlos Pena +debugger87 - Chaozhong Yang +e-dorigatti - Emilio Dorigatti +eric-maynard - Eric Maynard +felixalbani - Felix Albani +fjh100456 - fjh100456 --- End diff -- ah I missed this one. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22949: [minor] update known_translations
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22949 cc @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22949: [minor] update known_translations
GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/22949 [minor] update known_translations ## What changes were proposed in this pull request? update known_translations after running `translate-contributors.py` during 2.4.0 release ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark contributors Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22949.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22949 commit 6cd37374fc1917fe4c590304183521e9de3c4d23 Author: Wenchen Fan Date: 2018-11-05T14:50:52Z update known_translations --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22944: [SPARK-25942][SQL] Fix Dataset.groupByKey to make...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22944#discussion_r230772018 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala --- @@ -1556,6 +1556,14 @@ class DatasetSuite extends QueryTest with SharedSQLContext { df.where($"city".contains(new java.lang.Character('A'))), Seq(Row("Amsterdam"))) } + + test("SPARK-25942: typed aggregation on primitive data") { --- End diff -- how was this bug introduced? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22771: [SPARK-25773][Core]Cancel zombie tasks in a resul...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22771#discussion_r230730694 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -1364,6 +1385,21 @@ private[spark] class DAGScheduler( if (job.numFinished == job.numPartitions) { markStageAsFinished(resultStage) cleanupStateForJobAndIndependentStages(job) +try { + // killAllTaskAttempts will fail if a SchedulerBackend does not implement + // killTask. + logInfo(s"Job ${job.jobId} is finished. Cancelling potential speculative " + +"or zombie tasks for this job") + // ResultStage is only used by this job. It's safe to kill speculative or + // zombie tasks in this stage. + taskScheduler.killAllTaskAttempts( --- End diff -- cc @jiangxb1987 IIRC we have some similar code in barrier execution. Shall we create a util method to safely kill tasks? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22847: [SPARK-25850][SQL] Make the split threshold for the code...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22847 thanks, merging to master! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21732: [SPARK-24762][SQL] Enable Option of Product encoders
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21732 I think this is close, can you answer https://github.com/apache/spark/pull/21732/files#r228782670 ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21732: [SPARK-24762][SQL] Enable Option of Product encod...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21732#discussion_r230726457 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala --- @@ -1556,6 +1547,54 @@ class DatasetSuite extends QueryTest with SharedSQLContext { df.where($"city".contains(new java.lang.Character('A'))), Seq(Row("Amsterdam"))) } + + test("SPARK-24762: Enable top-level Option of Product encoders") { +val data = Seq(Some((1, "a")), Some((2, "b")), None) +val ds = data.toDS() + +checkDataset( + ds, + data: _*) + +val schema = StructType(Seq( --- End diff -- can we use the `add` API? e.g. ``` new StructType().add( "value", new StructType() .add("_1", ...) .add("_2", ...)) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21732: [SPARK-24762][SQL] Enable Option of Product encod...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21732#discussion_r230726136 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala --- @@ -393,4 +431,18 @@ class DatasetAggregatorSuite extends QueryTest with SharedSQLContext { assert(grouped.schema == df.schema) checkDataset(grouped.as[OptionBooleanData], OptionBooleanData("bob", Some(true))) } + + test("SPARK-24762: Aggregator should be able to use Option of Product encoder") { +val df = Seq( + OptionBooleanIntData("bob", Some((true, 1))), + OptionBooleanIntData("bob", Some((false, 2))), + OptionBooleanIntData("bob", None)).toDF() + +val group = df + .groupBy("name") + .agg(OptionBooleanIntAggregator("isGood").toColumn.alias("isGood")) +assert(df.schema == group.schema) --- End diff -- let's write down the expected schema ``` val expectedSchema = ... assert(df.schema == expectedSchema) assert(grouped.schema == ...) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22889: [SPARK-25882][SQL] Added a function to join two datasets...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22889 So we introduce a new API just to save typing `Seq(...)`? Maintaining an API has cost. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22928: [SPARK-25926][CORE] Move config entries in core module t...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22928 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22928: [SPARK-25926][CORE] Move config entries in core module t...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22928 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22923: [SPARK-25910][CORE] accumulator updates from previous st...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22923 > We need to always update user accumulators Ah that's a good point. I'm going to close it. I missed one thing: the `AppStatusListener` will keep the `StageInfo` instance until all tasks of that stage attempt is finished. See https://github.com/apache/spark/pull/22209 So this bug should already have been fixed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22923: [SPARK-25910][CORE] accumulator updates from prev...
Github user cloud-fan closed the pull request at: https://github.com/apache/spark/pull/22923 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22764: [SPARK-25765][ML] Add training cost to BisectingKMeans s...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22764 cc @dbtsai --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22786: [SPARK-25764][ML][EXAMPLES] Update BisectingKMeans examp...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22786 cc @dbtsai --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22869: [SPARK-25758][ML] Deprecate computeCost in BisectingKMea...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22869 cc @dbtsai --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22919: [SPARK-25906][SHELL] Documents '-I' option (from ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22919#discussion_r230686892 --- Diff: bin/spark-shell --- @@ -32,7 +32,10 @@ if [ -z "${SPARK_HOME}" ]; then source "$(dirname "$0")"/find-spark-home fi -export _SPARK_CMD_USAGE="Usage: ./bin/spark-shell [options]" +export _SPARK_CMD_USAGE="Usage: ./bin/spark-shell [options] + +Scala REPL options: + -Ipreload , enforcing line-by-line interpretation" --- End diff -- I mean, I didn't find ``` Options: --master MASTER_URL spark://host:port, mesos://host:port, yarn, k8s://https://host:port, or local (Default: local[*]). --deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or on one of the worker machines inside the cluster ("cluster") (Default: client). ``` in the shell script. Where do we define them? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22942: [SPARK-25884][SQL][FOLLOW-UP] Add sample.json back.
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22942 thanks, merging to master! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22919: [SPARK-25906][SHELL] Documents '-I' option (from ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22919#discussion_r230655513 --- Diff: bin/spark-shell --- @@ -32,7 +32,10 @@ if [ -z "${SPARK_HOME}" ]; then source "$(dirname "$0")"/find-spark-home fi -export _SPARK_CMD_USAGE="Usage: ./bin/spark-shell [options]" +export _SPARK_CMD_USAGE="Usage: ./bin/spark-shell [options] + +Scala REPL options: + -Ipreload , enforcing line-by-line interpretation" --- End diff -- where do we define other options? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22927: [SPARK-25918][SQL] LOAD DATA LOCAL INPATH should handle ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22927 I'll list in as a known issue in 2.4.0, thanks for fixing it! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22847: [SPARK-25850][SQL] Make the split threshold for the code...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22847 did you address https://github.com/apache/spark/pull/22847#issuecomment-434836278 ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22029: [SPARK-24395][SQL] IN operator should return NULL when c...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22029 If we decide to follow PostgreSQL about the EQUAL behavior eventually, then it will be much easier to fix the IN behavior, right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22923: [SPARK-25910][CORE] accumulator updates from previous st...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22923 cc @vanzin @zsxwing @jiangxb1987 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22923: [SPARK-25910][CORE] accumulator updates from prev...
GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/22923 [SPARK-25910][CORE] accumulator updates from previous stage attempt should not log error ## What changes were proposed in this pull request? For shuffle map stages, we may have multiple attempts, while only the latest attempt is active. However, the scheduler still accepts successful tasks from previous attempts, to speed up the execution. Each stage attempt has a `StageInfo` instance, which contains `TaskMetrics`. `TaskMetrics` has a bunch of accumulators to track the metrics like CPU time, etc. However, a stage only keeps the `StageInfo` of the latest attempt, which means the `StageInfo` of previous attempts will be GCed, and their accumulators of `TaskMetrics` will be cleaned. This causes a problem: When the scheduler accepts a successful task from a previous attempt, and tries to update accumulators, we may fail to get the accumulators from `AccumulatorContext`, as they are already cleaned. And we may hit error log like ``` 18/10/21 15:30:24 INFO ContextCleaner: Cleaned accumulator 2868 (name: internal.metrics.executorDeserializeTime) 18/10/21 15:30:24 ERROR DAGScheduler: Failed to update accumulators for task 7927 org.apache.spark.SparkException: attempted to access non-existent accumulator 2868 at org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1267) ... ``` This PR proposes a simple fix: When the scheduler receives successful tasks from previous attempts, don't update accumulators. Accumulators of previous stage attemps are not tracked anymore, so we don't need to update them. ## How was this patch tested? a new test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark late-task Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22923.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22923 commit 07f900cf845662186f8d1daea3be9abe2633d5c0 Author: Wenchen Fan Date: 2018-11-01T15:40:14Z accumulator updates from previous stage attempt commit 4d9cbe043604e76b6367e4ecb42d0d36437d1792 Author: Wenchen Fan Date: 2018-11-01T16:04:41Z different fix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22029: [SPARK-24395][SQL] IN operator should return NULL when c...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22029 If we want to follow PostgreSQL/Oracle for the IN behavior, why don't we follow the EQUAL behavior as well? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22029: [SPARK-24395][SQL] IN operator should return NULL when c...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22029 Another point: I think it's also important to make the behavior of IN be consistent with EQUAL. I tried PostgreSQL and `(1, 2) = (3, null)` returns null. Shall we update EQUAL first? The behavior of IN will be updated accordingly after we update `EQUAL`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22919: [SPARK-25906][SHELL] Restores '-i' option's behaviour in...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22919 personally I think Spark Shell should be consistent with the upstream Scala Shell, otherwise we may get another ticket complaining why we didn't follow... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22919: [SPARK-25906][SHELL] Restores '-i' option's behaviour in...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22919 so we would support both `-i` and `-I` in 2.4? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22029: [SPARK-24395][SQL] IN operator should return NULL when c...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22029 which Presto version did you test? I tried 0.203 and it fails ``` presto:default> select * from t2 where (1, 2) in (select x, y from t); Query 20181101_085707_00012_n644a failed: line 1:31: Multiple columns returned by subquery are not yet supported. Found 2 ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22029: [SPARK-24395][SQL] IN operator should return NULL when c...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22029 Do you know how Presto supports multi-value in subquery? By reading the PR description, it seems impossible if Preso treats `(a, b)` as a struct value. How Preso distinguish `(a, b) IN (select x,y ...)` and `struct_col IN (select x,y ...)`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22898: [SPARK-25746][SQL][followup] do not add unnecessary If e...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22898 thanks, merging to master! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22626: [SPARK-25638][SQL] Adding new function - to_csv()
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22626 This needs to be rebased. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22892: [SPARK-25884][SQL] Add TBLPROPERTIES and COMMENT, and us...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22892 thanks, merging to master! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22847: [SPARK-25850][SQL] Make the split threshold for the code...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22847 @rednaxelafx ah good point! It's hardcoded as 1024 too, and it's also doing method splitting. Let's apply the config there too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22029: [SPARK-24395][SQL] IN operator should return NULL...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22029#discussion_r229788060 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -212,27 +212,27 @@ object ReorderAssociativeOperator extends Rule[LogicalPlan] { * 1. Converts the predicate to false when the list is empty and *the value is not nullable. * 2. Removes literal repetitions. - * 3. Replaces [[In (value, seq[Literal])]] with optimized version + * 3. Replaces [[In (values, seq[Literal])]] with optimized version *[[InSet (value, HashSet[Literal])]] which is much faster. */ object OptimizeIn extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transform { case q: LogicalPlan => q transformExpressionsDown { - case In(v, list) if list.isEmpty => + case i @ In(_, list) if list.isEmpty => // When v is not nullable, the following expression will be optimized // to FalseLiteral which is tested in OptimizeInSuite.scala -If(IsNotNull(v), FalseLiteral, Literal(null, BooleanType)) - case expr @ In(v, list) if expr.inSetConvertible => +If(IsNotNull(i.value), FalseLiteral, Literal(null, BooleanType)) + case expr @ In(_, list) if expr.inSetConvertible => val newList = ExpressionSet(list).toSeq if (newList.length == 1 // TODO: `EqualTo` for structural types are not working. Until SPARK-24443 is addressed, // TODO: we exclude them in this rule. - && !v.isInstanceOf[CreateNamedStructLike] + && !expr.value.isInstanceOf[CreateNamedStructLike] --- End diff -- well, I think for this case we should optimize it. Anyway it follows the previous behavior, we can change it later. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22029: [SPARK-24395][SQL] IN operator should return NULL...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22029#discussion_r22978 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -212,27 +212,27 @@ object ReorderAssociativeOperator extends Rule[LogicalPlan] { * 1. Converts the predicate to false when the list is empty and *the value is not nullable. * 2. Removes literal repetitions. - * 3. Replaces [[In (value, seq[Literal])]] with optimized version + * 3. Replaces [[In (values, seq[Literal])]] with optimized version *[[InSet (value, HashSet[Literal])]] which is much faster. */ object OptimizeIn extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transform { case q: LogicalPlan => q transformExpressionsDown { - case In(v, list) if list.isEmpty => + case i @ In(_, list) if list.isEmpty => // When v is not nullable, the following expression will be optimized // to FalseLiteral which is tested in OptimizeInSuite.scala -If(IsNotNull(v), FalseLiteral, Literal(null, BooleanType)) - case expr @ In(v, list) if expr.inSetConvertible => +If(IsNotNull(i.value), FalseLiteral, Literal(null, BooleanType)) + case expr @ In(_, list) if expr.inSetConvertible => val newList = ExpressionSet(list).toSeq if (newList.length == 1 // TODO: `EqualTo` for structural types are not working. Until SPARK-24443 is addressed, // TODO: we exclude them in this rule. - && !v.isInstanceOf[CreateNamedStructLike] + && !expr.value.isInstanceOf[CreateNamedStructLike] --- End diff -- for your case, it's not `CreateNamedStructLike`, but just a struct type column? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22905: [SPARK-25894][SQL] Add a ColumnarFileFormat type ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22905#discussion_r229754853 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala --- @@ -306,7 +306,15 @@ case class FileSourceScanExec( withOptPartitionCount } -withSelectedBucketsCount +val withOptColumnCount = relation.fileFormat match { + case columnar: ColumnarFileFormat => +val sqlConf = relation.sparkSession.sessionState.conf +val columnCount = columnar.columnCountForSchema(sqlConf, requiredSchema) +withSelectedBucketsCount + ("ColumnCount" -> columnCount.toString) --- End diff -- shall we only include this info when the columnar reader is on? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22907: [SPARK-25896][CORE][WIP] Accumulator should only ...
GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/22907 [SPARK-25896][CORE][WIP] Accumulator should only be updated once for each successful task in shuffle map stage ## What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/19877 For the same reason, we should also update accumulator once for each successful task in shuffle map stage. TODO: 1. `ShuffleMapStage` has `pendingPartitions` and `findMissingPartitions`, I'm not sure which one is the single source of truth 2. When we receive repeated successful shuffle map tasks, seems we will override the previous one. Need a double check. 3. add tests. ## How was this patch tested? TODO You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark accum Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22907.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22907 commit 7876d9c44c15fe1fb64f1d7587c97e23ff2be5a2 Author: Wenchen Fan Date: 2018-10-31T15:23:48Z Accumulator should only be updated once for each successful task in shuffle map stage --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22029: [SPARK-24395][SQL] IN operator should return NULL...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22029#discussion_r229708259 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -212,27 +212,27 @@ object ReorderAssociativeOperator extends Rule[LogicalPlan] { * 1. Converts the predicate to false when the list is empty and *the value is not nullable. * 2. Removes literal repetitions. - * 3. Replaces [[In (value, seq[Literal])]] with optimized version + * 3. Replaces [[In (values, seq[Literal])]] with optimized version *[[InSet (value, HashSet[Literal])]] which is much faster. */ object OptimizeIn extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transform { case q: LogicalPlan => q transformExpressionsDown { - case In(v, list) if list.isEmpty => + case i @ In(_, list) if list.isEmpty => // When v is not nullable, the following expression will be optimized // to FalseLiteral which is tested in OptimizeInSuite.scala -If(IsNotNull(v), FalseLiteral, Literal(null, BooleanType)) - case expr @ In(v, list) if expr.inSetConvertible => +If(IsNotNull(i.value), FalseLiteral, Literal(null, BooleanType)) + case expr @ In(_, list) if expr.inSetConvertible => val newList = ExpressionSet(list).toSeq if (newList.length == 1 // TODO: `EqualTo` for structural types are not working. Until SPARK-24443 is addressed, // TODO: we exclude them in this rule. - && !v.isInstanceOf[CreateNamedStructLike] + && !expr.value.isInstanceOf[CreateNamedStructLike] --- End diff -- IIUC, you mean `expr.values.length > 1` => `expr.value.isInstanceOf[CreateNamedStructLike]` but `expr.value.isInstanceOf[CreateNamedStructLike]` can't => `expr.values.length > 1` Can you give an example? Based on my understanding, the code here is trying to optimize a case when it's not a multi-value in and the list has only one element. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22905: [SPARK-25894][SQL] Add a ColumnarFileFormat type which r...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22905 is there anything blocked by this? I agree this is a good feature, but it asks the data source to provide a new ability, which may become a problem when migrating file sources to data source v2. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22029: [SPARK-24395][SQL] IN operator should return NULL...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22029#discussion_r229701584 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -212,27 +212,27 @@ object ReorderAssociativeOperator extends Rule[LogicalPlan] { * 1. Converts the predicate to false when the list is empty and *the value is not nullable. * 2. Removes literal repetitions. - * 3. Replaces [[In (value, seq[Literal])]] with optimized version + * 3. Replaces [[In (values, seq[Literal])]] with optimized version *[[InSet (value, HashSet[Literal])]] which is much faster. */ object OptimizeIn extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transform { case q: LogicalPlan => q transformExpressionsDown { - case In(v, list) if list.isEmpty => + case i @ In(_, list) if list.isEmpty => // When v is not nullable, the following expression will be optimized // to FalseLiteral which is tested in OptimizeInSuite.scala -If(IsNotNull(v), FalseLiteral, Literal(null, BooleanType)) - case expr @ In(v, list) if expr.inSetConvertible => +If(IsNotNull(i.value), FalseLiteral, Literal(null, BooleanType)) + case expr @ In(_, list) if expr.inSetConvertible => val newList = ExpressionSet(list).toSeq if (newList.length == 1 // TODO: `EqualTo` for structural types are not working. Until SPARK-24443 is addressed, // TODO: we exclude them in this rule. - && !v.isInstanceOf[CreateNamedStructLike] + && !expr.value.isInstanceOf[CreateNamedStructLike] --- End diff -- shall we use `expr.values.length == 1` here to make it more clear? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22029: [SPARK-24395][SQL] IN operator should return NULL...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22029#discussion_r229700708 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -212,27 +212,34 @@ object ReorderAssociativeOperator extends Rule[LogicalPlan] { * 1. Converts the predicate to false when the list is empty and *the value is not nullable. * 2. Removes literal repetitions. - * 3. Replaces [[In (value, seq[Literal])]] with optimized version - *[[InSet (value, HashSet[Literal])]] which is much faster. + * 3. Replaces [[In (values, seq[Literal])]] with optimized version + *[[InSet (values, HashSet[Literal])]] which is much faster. */ object OptimizeIn extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transform { case q: LogicalPlan => q transformExpressionsDown { - case In(v, list) if list.isEmpty => -// When v is not nullable, the following expression will be optimized + case i @ In(values, list) if list.isEmpty => +// When values are not nullable, the following expression will be optimized // to FalseLiteral which is tested in OptimizeInSuite.scala -If(IsNotNull(v), FalseLiteral, Literal(null, BooleanType)) - case expr @ In(v, list) if expr.inSetConvertible => +val isNotNull = if (SQLConf.get.inFalseForNullField) { + IsNotNull(i.value) +} else { + val valuesNotNull: Seq[Expression] = values.map(IsNotNull) + valuesNotNull.tail.foldLeft(valuesNotNull.head)(And) --- End diff -- nit: `values.map(IsNotNull).reduce(And)` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22029: [SPARK-24395][SQL] IN operator should return NULL...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22029#discussion_r229699828 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala --- @@ -339,37 +371,57 @@ case class In(value: Expression, list: Seq[Expression]) extends Predicate { * Optimized version of In clause, when all filter values of In clause are * static. */ -case class InSet(child: Expression, hset: Set[Any]) extends UnaryExpression with Predicate { +case class InSet(values: Seq[Expression], hset: Set[Any]) extends InBase { require(hset != null, "hset could not be null") - override def toString: String = s"$child INSET ${hset.mkString("(", ",", ")")}" + override def toString: String = s"$value INSET ${hset.mkString("(", ",", ")")}" - @transient private[this] lazy val hasNull: Boolean = hset.contains(null) + override def children: Seq[Expression] = values - override def nullable: Boolean = child.nullable || hasNull + @transient private[this] lazy val hasNull: Boolean = { +if (isMultiValued && !SQLConf.get.inFalseForNullField) { + hset.exists(checkNullEval) +} else { + hset.contains(null) +} + } - protected override def nullSafeEval(value: Any): Any = { -if (set.contains(value)) { - true -} else if (hasNull) { + override def nullable: Boolean = { +val isValueNullable = if (isMultiValued && !SQLConf.get.inFalseForNullField) { + values.exists(_.nullable) +} else { + value.nullable +} +isValueNullable || hasNull + } + + override def eval(input: InternalRow): Any = { +val inputValue = value.eval(input) +if (checkNullEval(inputValue)) { --- End diff -- do we change behavior here? seems `null inset (null, xxx)` returns true previously. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22029: [SPARK-24395][SQL] IN operator should return NULL...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22029#discussion_r229697077 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -1561,6 +1561,16 @@ object SQLConf { .booleanConf .createWithDefault(false) + val LEGACY_IN_FALSE_FOR_NULL_FIELD = +buildConf("spark.sql.legacy.inOperator.falseForNullField") + .internal() + .doc("When set to true (default), the IN operator returns false when comparing multiple " + +"values containing a null. When set to false, it returns null, instead. This is " + +"important especially when using NOT IN as in the second case, it filters out the rows " + +"when a null is present in a field; while in the first one, those rows are returned.") + .booleanConf + .createWithDefault(true) --- End diff -- shall we set `false` as default to follow SQL standard? and be consistent with in-subquery --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22029: [SPARK-24395][SQL] IN operator should return NULL...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22029#discussion_r229692081 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala --- @@ -202,7 +225,11 @@ case class InSubquery(values: Seq[Expression], query: ListQuery) */ // scalastyle:off line.size.limit @ExpressionDescription( - usage = "expr1 _FUNC_(expr2, expr3, ...) - Returns true if `expr` equals to any valN.", + usage = """ +expr1 _FUNC_(expr2, expr3, ...) - Returns true if `expr` equals to any valN. Otherwise, if + spark.sql.legacy.inOperator.falseForNullField is false and any of the elements or fields of --- End diff -- `any of the elements or fields ...` We should explicitly mention multi-column IN, which is different from `a in (b, c, ...)` while `a` is struct type. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22892: [SPARK-25884][SQL] Add TBLPROPERTIES and COMMENT, and us...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22892 LGTM except some minor comments --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22892: [SPARK-25884][SQL] Add TBLPROPERTIES and COMMENT,...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22892#discussion_r229672667 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveShowCreateTableSuite.scala --- @@ -0,0 +1,198 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import org.apache.spark.sql.{AnalysisException, ShowCreateTableSuite} +import org.apache.spark.sql.hive.test.TestHiveSingleton + +class HiveShowCreateTableSuite extends ShowCreateTableSuite with TestHiveSingleton { + + test("simple hive table") { +withTable("t1") { + sql( +s"""CREATE TABLE t1 ( + | c1 INT COMMENT 'bla', + | c2 STRING + |) + |TBLPROPERTIES ( + | 'prop1' = 'value1', + | 'prop2' = 'value2' + |) + """.stripMargin + ) + + checkCreateTable("t1") +} + } + + test("simple external hive table") { +withTempDir { dir => + withTable("t1") { +sql( + s"""CREATE TABLE t1 ( + | c1 INT COMMENT 'bla', + | c2 STRING + |) + |LOCATION '${dir.toURI}' + |TBLPROPERTIES ( + | 'prop1' = 'value1', + | 'prop2' = 'value2' + |) + """.stripMargin +) + +checkCreateTable("t1") + } +} + } + + test("partitioned hive table") { --- End diff -- do we have tests for partitioned/bucketed data source table? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22892: [SPARK-25884][SQL] Add TBLPROPERTIES and COMMENT,...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22892#discussion_r229671459 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -1063,21 +1067,19 @@ case class ShowCreateTableCommand(table: TableIdentifier) extends RunnableComman val dataSourceOptions = metadata.storage.properties.map { case (key, value) => s"${quoteIdentifier(key)} '${escapeSingleQuotedString(value)}'" -} ++ metadata.storage.locationUri.flatMap { location => - if (metadata.tableType == MANAGED) { -// If it's a managed table, omit PATH option. Spark SQL always creates external table -// when the table creation DDL contains the PATH option. -None - } else { -Some(s"path '${escapeSingleQuotedString(CatalogUtils.URIToString(location))}'") - } } if (dataSourceOptions.nonEmpty) { builder ++= "OPTIONS (\n" builder ++= dataSourceOptions.mkString(" ", ",\n ", "\n") builder ++= ")\n" } + +if (metadata.tableType == EXTERNAL) { --- End diff -- shall we also make it a method like `showTableComment`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22898: [SPARK-25746][SQL][followup] do not add unnecessa...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22898#discussion_r229667653 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala --- @@ -124,14 +124,9 @@ object ExpressionEncoder { s"`GetColumnByOrdinal`, but there are ${getColExprs.size}") val input = GetStructField(GetColumnByOrdinal(0, schema), index) - val newDeserializer = enc.objDeserializer.transformUp { + enc.objDeserializer.transformUp { case GetColumnByOrdinal(0, _) => input } - if (schema(index).nullable) { -If(IsNull(input), Literal.create(null, newDeserializer.dataType), newDeserializer) --- End diff -- good catch! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21860: [SPARK-24901][SQL]Merge the codegen of RegularHashMap an...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21860 thanks, merging to master! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22788: [SPARK-25769][SQL]escape nested columns by backti...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22788#discussion_r229585323 --- Diff: sql/core/src/test/resources/sql-tests/results/columnresolution-negative.sql.out --- @@ -161,7 +161,7 @@ SELECT db1.t1.i1 FROM t1, mydb2.t1 struct<> -- !query 18 output org.apache.spark.sql.AnalysisException -cannot resolve '`db1.t1.i1`' given input columns: [mydb2.t1.i1, mydb2.t1.i1]; line 1 pos 7 +cannot resolve '`db1`.`t1`.`i1`' given input columns: [mydb2.t1.i1, mydb2.t1.i1]; line 1 pos 7 --- End diff -- > Is it okay to drop the backtick from the first identifier AFAIK both `name` and `sql` are for display/message. I think dropping backtick is fine if no ambiguous --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org