[GitHub] spark issue #19086: [SPARK-21874][SQL] Support changing database when rename...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/19086 Sure, current behavior is hive behavior. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19086: [SPARK-21874][SQL] Support changing database when rename...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19086 Please hold it. It means it is a behavior change. Let me consider it more. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19086: [SPARK-21874][SQL] Support changing database when rename...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/19086 yes, correct --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19086: [SPARK-21874][SQL] Support changing database when rename...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19086 > use db2; > alter table db1.t2 rename to t1; After this PR, it is renamed to `db2.t1`, right? Before this PR, it is renamed to `db1.t1`, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19115: [SPARK-21882][CORE] OutputMetrics doesn't count written ...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19115 Please create a new pr against master branch and close this one. If the issue doesn't exist in master branch, then consider backporting that fix to 2.2 branch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19086: [SPARK-21874][SQL] Support changing database when...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/19086#discussion_r136747432 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala --- @@ -569,46 +569,51 @@ class SessionCatalog( /** * Rename a table. * - * If a database is specified in `oldName`, this will rename the table in that database. + * If the database specified in `newName` is different from the one specified in `oldName`, + * It will result in moving table across databases. + * * If no database is specified, this will first attempt to rename a temporary table with - * the same name, then, if that does not exist, rename the table in the current database. + * the same name, then, if that does not exist, current database will be used for locating + * `oldName` or `newName`. --- End diff -- So should I make the comment here unchanged? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19086: [SPARK-21874][SQL] Support changing database when...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19086#discussion_r136747159 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala --- @@ -418,6 +439,42 @@ abstract class SessionCatalogSuite extends AnalysisTest { } } + test("rename global temp view") { +withBasicCatalog { catalog => + val globalTempTable = Range(1, 10, 2, 10) + catalog.createGlobalTempView("tbl1", globalTempTable, overrideIfExists = false) + assert(catalog.getGlobalTempView("tbl1") == Option(globalTempTable)) + assert(catalog.externalCatalog.listTables("db2").toSet == Set("tbl1", "tbl2")) + // If database is not specified, global temp view will not be renamed + catalog.setCurrentDatabase("db1") + val thrown1 = intercept[AnalysisException] { +catalog.renameTable(TableIdentifier("tbl1"), TableIdentifier("tblone")) + } + assert(thrown1.getMessage.contains("Table or view 'tbl1' not found in database 'db1'")) + catalog.setCurrentDatabase("db2") + catalog.renameTable(TableIdentifier("tbl1"), TableIdentifier("tblone")) + assert(catalog.getGlobalTempView("tbl1") == Option(globalTempTable)) + assert(catalog.externalCatalog.listTables("db2").toSet == Set("tblone", "tbl2")) + // Moving global temp view to another database is forbidden + val thrown2 = intercept[AnalysisException] { +catalog.renameTable( + TableIdentifier("tbl1", Some("global_temp")), TableIdentifier("tbl3", Some("db2"))) + } + assert(thrown2.getMessage.contains("Cannot change database of table 'tbl1'")) + // Moving table from regular database to be a global temp view is forbidden + val thrown3 = intercept[AnalysisException] { +catalog.renameTable( + TableIdentifier("tbl2", Some("db2")), TableIdentifier("tbltwo", Some("global_temp"))) + } --- End diff -- Normally, we put `.getMessage` here. Thus, we can make the next line shorter. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17014 **[Test build #81372 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81372/testReport)** for PR 17014 at commit [`df4d263`](https://github.com/apache/spark/commit/df4d263d21c3e4413b4d74aaabcd3a10348a0001). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17014 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81372/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17014 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18931: [SPARK-21717][SQL] Decouple consume functions of ...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/18931#discussion_r136746833 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/ExpandExec.scala --- @@ -89,6 +89,8 @@ case class ExpandExec( child.asInstanceOf[CodegenSupport].inputRDDs() } + override protected def doConsumeInChainOfFunc: Boolean = false --- End diff -- yea, probably we might need to describe more about exceptional cases we can't use this optimization like `HashAggregateExec` in https://github.com/apache/spark/pull/18931/files#diff-28cb12941b992ff680c277c651b59aa0R204 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17014 **[Test build #81372 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81372/testReport)** for PR 17014 at commit [`df4d263`](https://github.com/apache/spark/commit/df4d263d21c3e4413b4d74aaabcd3a10348a0001). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18931: [SPARK-21717][SQL] Decouple consume functions of ...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/18931#discussion_r136746468 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SortExec.scala --- @@ -177,6 +177,8 @@ case class SortExec( """.stripMargin.trim } + override protected def doConsumeInChainOfFunc: Boolean = false --- End diff -- yea, other guys might give good suggestions on the naming... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19086: [SPARK-21874][SQL] Support changing database when rename...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19086 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19086: [SPARK-21874][SQL] Support changing database when rename...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19086 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81370/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19086: [SPARK-21874][SQL] Support changing database when rename...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19086 **[Test build #81370 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81370/testReport)** for PR 19086 at commit [`d4a0f97`](https://github.com/apache/spark/commit/d4a0f97b9600dd661d81a79e71019a781eb812ff). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19086: [SPARK-21874][SQL] Support changing database when...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19086#discussion_r136745765 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala --- @@ -569,46 +569,51 @@ class SessionCatalog( /** * Rename a table. * - * If a database is specified in `oldName`, this will rename the table in that database. + * If the database specified in `newName` is different from the one specified in `oldName`, + * It will result in moving table across databases. + * * If no database is specified, this will first attempt to rename a temporary table with - * the same name, then, if that does not exist, rename the table in the current database. + * the same name, then, if that does not exist, current database will be used for locating + * `oldName` or `newName`. --- End diff -- This is wrong, right? This is for `temp view` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19116: [SPARK-21903][BUILD] Upgrade scalastyle to 1.0.0.
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19116 cc @srowen and @vanzin, could you take a look please when you have some time? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19116: [SPARK-21903][BUILD] Upgrade scalastyle to 1.0.0.
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19116#discussion_r136745412 --- Diff: scalastyle-config.xml --- @@ -268,10 +268,7 @@ This file is divided into 3 sections: - -^Override$ -override modifier should be used instead of @java.lang.Override. - + --- End diff -- This removes the workaround we used and use `org.scalastyle.scalariform.OverrideJavaChecker` - `https://github.com/scalastyle/scalastyle/issues/214`. I manually tested with `@Override`: Before: ``` [error] ...:435:3: override modifier should be used instead of @java.lang.Override. ``` After: ``` [error] ...:435:2: override keyword should be used instead of @java.lang.Override ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19116: [SPARK-21903][BUILD] Upgrade scalastyle to 1.0.0.
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19116#discussion_r136745459 --- Diff: project/plugins.sbt --- @@ -7,8 +7,7 @@ addSbtPlugin("com.typesafe.sbteclipse" % "sbteclipse-plugin" % "5.1.0") // sbt 1.0.0 support: https://github.com/jrudolph/sbt-dependency-graph/issues/134 addSbtPlugin("net.virtual-void" % "sbt-dependency-graph" % "0.8.2") -// need to make changes to uptake sbt 1.0 support in "org.scalastyle" %% "scalastyle-sbt-plugin" % "0.9.0" --- End diff -- Looks we are okay to remove this - https://github.com/scalastyle/scalastyle-sbt-plugin/blob/v1.0.0/README.md --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19116: [SPARK-21903][BUILD] Upgrade scalastyle to 1.0.0.
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19116#discussion_r136745559 --- Diff: project/SparkBuild.scala --- @@ -163,14 +163,15 @@ object SparkBuild extends PomBuild { val configUrlV = scalastyleConfigUrl.in(config).value val streamsV = streams.in(config).value val failOnErrorV = true +val failOnWarningV = true val scalastyleTargetV = scalastyleTarget.in(config).value val configRefreshHoursV = scalastyleConfigRefreshHours.in(config).value val targetV = target.in(config).value val configCacheFileV = scalastyleConfigUrlCacheFile.in(config).value logger.info(s"Running scalastyle on ${name.value} in ${config.name}") -Tasks.doScalastyle(args, configV, configUrlV, failOnErrorV, scalaSourceV, scalastyleTargetV, - streamsV, configRefreshHoursV, targetV, configCacheFileV) +Tasks.doScalastyle(args, configV, configUrlV, failOnErrorV, failOnWarningV, scalaSourceV, --- End diff -- This change looks required - https://github.com/scalastyle/scalastyle-sbt-plugin/commit/cbbb9c9d1c33b9e90bf187edb0dd2d6b4eceaa21 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19086: [SPARK-21874][SQL] Support changing database when...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19086#discussion_r136745597 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala --- @@ -569,46 +569,51 @@ class SessionCatalog( /** * Rename a table. * - * If a database is specified in `oldName`, this will rename the table in that database. + * If the database specified in `newName` is different from the one specified in `oldName`, + * It will result in moving table across databases. --- End diff -- `It will result` -> `it results` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19116: [SPARK-21903][BUILD] Upgrade scalastyle to 1.0.0.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19116 **[Test build #81371 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81371/testReport)** for PR 19116 at commit [`2447fd0`](https://github.com/apache/spark/commit/2447fd0e152ace4dd074a92bd0d3cdc638b09b1a). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19116: [SPARK-21903][BUILD] Upgrade scalastyle to 1.0.0.
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/19116 [SPARK-21903][BUILD] Upgrade scalastyle to 1.0.0. ## What changes were proposed in this pull request? 1.0.0 fixes an issue with import order, explicit type for public methods, line length limitation and comment validation: ``` [error] .../spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/Main.scala:50:16: Are you sure you want to println? If yes, wrap the code block with [error] // scalastyle:off println [error] println(...) [error] // scalastyle:on println [error] .../spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkILoop.scala:49: File line length exceeds 100 characters [error] .../spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkILoop.scala:22:21: Are you sure you want to println? If yes, wrap the code block with [error] // scalastyle:off println [error] println(...) [error] // scalastyle:on println [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:35:6: Public method must have explicit type [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:51:6: Public method must have explicit type [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:93:15: Public method must have explicit type [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:98:15: Public method must have explicit type [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:47:2: Insert a space after the start of the comment [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:26:43: JavaDStream should come before JavaDStreamLike. ``` This PR also fixes the workaround added in SPARK-16877 for `org.scalastyle.scalariform.OverrideJavaChecker` feature, added from 0.9.0. ## How was this patch tested? Manually tested. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark scalastyle-1.0.0 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19116.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19116 commit 2447fd0e152ace4dd074a92bd0d3cdc638b09b1a Author: hyukjinkwonDate: 2017-09-04T04:26:19Z Upgrade scalastyle to 1.0.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/18902 hmm... that's interesting. So I found performance gap between dataframe codegen aggregation and the simple RDD aggregation. I will discuss with SQL team for this later. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/18902 @WeichenXu123 No, I only cache the DataFrame. And the RDD-Version is [here](https://github.com/apache/spark/pull/18902/commits/8daffc9007c65f04e005ffe5dcfbeca634480465). I use the same testsuit above to test those impls. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19113: [SPARK-20978][SQL] Bump up Univocity version to 2.5.4
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19113 How about the other popular open source projects? Do you know whether which projects are using Univocity 2.5? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/17014 @zhengruifeng `KMeans` regarded as a bugfix(SPARK-21799) because the double-cache issue is introduced in 2.2 and cause perf regression. Other algos also have the same issue, but the issue exists in those algos for a long time and they related to `Predictor` so it is not so easy to fix, we can leave them in here and do more discussion. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19113: [SPARK-20978][SQL] Bump up Univocity version to 2.5.4
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19113 Any performance measure from 2.2 to 2.5? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18865 Yea, it is and that's what we discussed - https://github.com/apache/spark/pull/18865#discussion_r131887972. In that way, I was thinking https://github.com/apache/spark/pull/18865#issuecomment-326858521 case could make sense as no columns are selected. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18869: [SPARK-21654][SQL] Complement SQL predicates expr...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/18869 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18869: [SPARK-21654][SQL] Complement SQL predicates expression ...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/18869 Thanks @HyukjinKwon @gatorsmile --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18869: [SPARK-21654][SQL] Complement SQL predicates expression ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18869 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18869: [SPARK-21654][SQL] Complement SQL predicates expression ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18869 Thanks! Merging to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/18865 But doesn't this behavior that `_corrupt_record` content depends on the selected json fields is designed at the first? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18865 > the usage can cause weird results too > `_corrupt_record` should return all the records that Spark SQL fail to parse I think another point should be, this issue still exists even if we disallow selecting `_corrupt_record` alone here. If we select few columns together with `_corrupt_record`, this case will still exist and results won't be consistent. For example, ``` echo '{"fieldA": 1, "fieldB": 2} {"fieldA": 3, "fieldB": 4} {"fieldA": "5", "fieldB": 6}' >/tmp/sample.json ``` ```scala val file = "/tmp/sample.json" val dfFromFile = spark.read.schema("fieldA BYTE, fieldB BYTE, _corrupt_record STRING").json(file) dfFromFile.select($"fieldA", $"_corrupt_record").show() dfFromFile.select($"fieldB", $"_corrupt_record").show() ``` ``` scala> dfFromFile.select($"fieldA", $"_corrupt_record").show() +--++ |fieldA| _corrupt_record| +--++ | 1|null| | 3|null| | null| {"fieldA": "5", ...| +--++ scala> dfFromFile.select($"fieldB", $"_corrupt_record").show() +--+---+ |fieldB|_corrupt_record| +--+---+ | 2| null| | 4| null| | 6| null| +--+---+ ``` If we should disallow, I think we should rather deprecate this option first with some warnings, or explain this behaviour in the documentation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18931: [SPARK-21717][SQL] Decouple consume functions of ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18931#discussion_r136743435 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SortExec.scala --- @@ -177,6 +177,8 @@ case class SortExec( """.stripMargin.trim } + override protected def doConsumeInChainOfFunc: Boolean = false --- End diff -- I revised this variable name in times, but didn't find a good name to convey its meaning. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19115: [SPARK-21882][CORE] OutputMetrics doesn't count written ...
Github user awarrior commented on the issue: https://github.com/apache/spark/pull/19115 @jerryshao hi~ I have modified this PR. But this patch just work in 2.2.0 (some changes apply now). I want to confirm whether I need to create a new PR. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19113: [SPARK-20978][SQL] Bump up Univocity version to 2.5.4
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19113 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19113: [SPARK-20978][SQL] Bump up Univocity version to 2.5.4
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19113 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81368/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19113: [SPARK-20978][SQL] Bump up Univocity version to 2.5.4
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19113 **[Test build #81368 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81368/testReport)** for PR 19113 at commit [`fa7eb51`](https://github.com/apache/spark/commit/fa7eb514cfd91bb405fd74680b08d5865911e3f0). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18865 BTW, this change should be put into the migration guide of Spark SQL. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18865 From the viewpoints of the end users of Spark, `dfFromFile.select($"_corrupt_record").show()` might not return all the expected records. ``_corrupt_record`` should return all the records that Spark SQL fail to parse. If we are unable to output the expected results, we need to disallow users to do it. cc @cloud-fan --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18931: [SPARK-21717][SQL] Decouple consume functions of ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18931#discussion_r136742515 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/ExpandExec.scala --- @@ -89,6 +89,8 @@ case class ExpandExec( child.asInstanceOf[CodegenSupport].inputRDDs() } + override protected def doConsumeInChainOfFunc: Boolean = false --- End diff -- The `doConsume` produces something like: |for (int $i = 0; $i < ${projections.length}; $i ++) { | switch ($i) { |${cases.mkString("\n").trim} | } | $numOutput.add(1); | ${consume(ctx, outputColumns)} |} So the consume logic of its parent node is actually wrapped in a local for-loop. It has the same effect as not chain the next consume. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18931: [SPARK-21717][SQL] Decouple consume functions of ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18931#discussion_r136742331 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala --- @@ -149,14 +149,146 @@ trait CodegenSupport extends SparkPlan { ctx.freshNamePrefix = parent.variablePrefix val evaluated = evaluateRequiredVariables(output, inputVars, parent.usedInputs) + +// Under certain conditions, we can put the logic to consume the rows of this operator into +// another function. So we can prevent a generated function too long to be optimized by JIT. +// The conditions: +// 1. The parent uses all variables in output. we can't defer variable evaluation when consume +//in another function. +// 2. The output variables are not empty. If it's empty, we don't bother to do that. +// 3. We don't use row variable. The construction of row uses deferred variable evaluation. We +//can't do it. +// 4. The number of output variables must less than maximum number of parameters in Java method +//declaration. +val requireAllOutput = output.forall(parent.usedInputs.contains(_)) +val consumeFunc = + if (row == null && outputVars.nonEmpty && requireAllOutput && outputVars.length < 255) { +constructDoConsumeFunction(ctx, inputVars) + } else { +parent.doConsume(ctx, inputVars, rowVar) + } s""" |${ctx.registerComment(s"CONSUME: ${parent.simpleString}")} |$evaluated - |${parent.doConsume(ctx, inputVars, rowVar)} + |$consumeFunc + """.stripMargin + } + + /** + * To prevent concatenated function growing too long to be optimized by JIT. Instead of inlining, + * we may put the consume logic of parent operator into a function and set this flag to `true`. + * The parent operator can know if its consume logic is inlined or in separated function. + */ + private var doConsumeInFunc: Boolean = false + + /** + * Returning true means we have at least one consume logic from child operator or this operator is + * separated in a function. If this is `true`, this operator shouldn't use `continue` statement to + * continue on next row, because its generated codes aren't enclosed in main while-loop. + * + * For example, we have generated codes for a query plan like: + * Op1Exec + * Op2Exec + * Op3Exec + * + * If we put the consume code of Op2Exec into a separated function, the generated codes are like: + * while (...) { + * ... // logic of Op3Exec. + * Op2Exec_doConsume(...); + * } + * private boolean Op2Exec_doConsume(...) { + * ... // logic of Op2Exec to consume rows. + * } + * For now, `doConsumeInChainOfFunc` of Op2Exec will be `true`. + * + * Notice for some operators like `HashAggregateExec`, it doesn't chain previous consume functions + * but begins with its produce framework. We should override `doConsumeInChainOfFunc` to return + * `false`. + */ + protected def doConsumeInChainOfFunc: Boolean = { +val codegenChildren = children.map(_.asInstanceOf[CodegenSupport]) +doConsumeInFunc || codegenChildren.exists(_.doConsumeInChainOfFunc) + } + + /** + * The actual java statement this operator should use if there is a need to continue on next row + * in its `doConsume` codes. + * + * while (...) { + * ... // logic of Op3Exec. + * Op2Exec_doConsume(...); + * } + * private boolean Op2Exec_doConsume(...) { + * ... // logic of Op2Exec to consume rows. + * continue; // Wrong. We can't use continue with the while-loop. + * } + * In above code, we can't use `continue` in `Op2Exec_doConsume`. + * + * Instead, we do something like: + * while (...) { + * ... // logic of Op3Exec. + * boolean continueForLoop = Op2Exec_doConsume(...); + * if (continueForLoop) continue; + * } + * private boolean Op2Exec_doConsume(...) { + * ... // logic of Op2Exec to consume rows. + * return true; // When we need to do continue, we return true. + * } + */ + protected def continueStatementInDoConsume: String = if (doConsumeInChainOfFunc) { +"return true;"; --- End diff -- Thanks. I'll fix it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at
[GitHub] spark pull request #18931: [SPARK-21717][SQL] Decouple consume functions of ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18931#discussion_r136742282 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala --- @@ -149,14 +149,146 @@ trait CodegenSupport extends SparkPlan { ctx.freshNamePrefix = parent.variablePrefix val evaluated = evaluateRequiredVariables(output, inputVars, parent.usedInputs) + +// Under certain conditions, we can put the logic to consume the rows of this operator into +// another function. So we can prevent a generated function too long to be optimized by JIT. +// The conditions: +// 1. The parent uses all variables in output. we can't defer variable evaluation when consume +//in another function. +// 2. The output variables are not empty. If it's empty, we don't bother to do that. +// 3. We don't use row variable. The construction of row uses deferred variable evaluation. We +//can't do it. +// 4. The number of output variables must less than maximum number of parameters in Java method +//declaration. +val requireAllOutput = output.forall(parent.usedInputs.contains(_)) +val consumeFunc = + if (row == null && outputVars.nonEmpty && requireAllOutput && outputVars.length < 255) { +constructDoConsumeFunction(ctx, inputVars) --- End diff -- I was thinking to check it. But the whole-stage codegen is a non-breaking processing which produce/consume calls are embeded together. You don't have a break to check the function length here. Actually I think it should have no negative effect to split consume functions always. From the benchmarking numbers, looks it shows no harm to normal queries. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18931: [SPARK-21717][SQL] Decouple consume functions of ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18931#discussion_r136742019 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala --- @@ -149,14 +149,146 @@ trait CodegenSupport extends SparkPlan { ctx.freshNamePrefix = parent.variablePrefix val evaluated = evaluateRequiredVariables(output, inputVars, parent.usedInputs) + +// Under certain conditions, we can put the logic to consume the rows of this operator into +// another function. So we can prevent a generated function too long to be optimized by JIT. +// The conditions: +// 1. The parent uses all variables in output. we can't defer variable evaluation when consume +//in another function. +// 2. The output variables are not empty. If it's empty, we don't bother to do that. +// 3. We don't use row variable. The construction of row uses deferred variable evaluation. We --- End diff -- The same reason as above. The variables used to evaluate the row can be out of scope because row construction is deferred until it is used actually. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18931: [SPARK-21717][SQL] Decouple consume functions of ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18931#discussion_r136741637 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala --- @@ -149,14 +149,146 @@ trait CodegenSupport extends SparkPlan { ctx.freshNamePrefix = parent.variablePrefix val evaluated = evaluateRequiredVariables(output, inputVars, parent.usedInputs) + +// Under certain conditions, we can put the logic to consume the rows of this operator into +// another function. So we can prevent a generated function too long to be optimized by JIT. +// The conditions: +// 1. The parent uses all variables in output. we can't defer variable evaluation when consume --- End diff -- E.g., a `ProjectExec` node doesn't necessarily evaluate all its output variables before continuing `doConsume` of its parent node. It can defer the evaluation until the variables are needed in the parent node's consume logic. Once a variable's evaluation is deferred, and if we create a consume function, the variable will be evaluated in the function. But now the references of this variable is out of scope. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19050: [SPARK-21835][SQL] RewritePredicateSubquery should not p...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19050 ping @cloud-fan @hvanhovell This blocks the #18956 going forward, can you help review this change? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18869: [SPARK-21654][SQL] Complement SQL predicates expression ...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/18869 @HyukjinKwon @gatorsmile Any more comments on this? The added tests should be enough. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18975#discussion_r136740379 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala --- @@ -534,4 +534,132 @@ class InsertIntoHiveTableSuite extends QueryTest with TestHiveSingleton with Bef } } } + + test("insert overwrite to dir from hive metastore table") { +withTempDir { dir => + val path = dir.toURI.getPath + + sql(s"INSERT OVERWRITE LOCAL DIRECTORY '${path}' SELECT * FROM src where key < 10") + + sql( +s""" + |INSERT OVERWRITE LOCAL DIRECTORY '${path}' + |STORED AS orc + |SELECT * FROM src where key < 10 + """.stripMargin) + + // use orc data source to check the data of path is right. + withTempView("orc_source") { +sql( + s""" + |CREATE TEMPORARY VIEW orc_source + |USING org.apache.spark.sql.hive.orc + |OPTIONS ( + | PATH '${dir.getCanonicalPath}' + |) + """.stripMargin) + +checkAnswer( + sql("select * from orc_source"), + sql("select * from src where key < 10").collect()) + } +} + } + + test("insert overwrite to local dir from temp table") { +withTempView("test_insert_table") { + spark.range(10).selectExpr("id", "id AS str").createOrReplaceTempView("test_insert_table") + + withTempDir { dir => +val path = dir.toURI.getPath + +sql( + s""" + |INSERT OVERWRITE LOCAL DIRECTORY '${path}' + |ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' + |SELECT * FROM test_insert_table + """.stripMargin) + +sql( + s""" + |INSERT OVERWRITE LOCAL DIRECTORY '${path}' + |STORED AS orc + |SELECT * FROM test_insert_table + """.stripMargin) + +// use orc data source to check the data of path is right. +withTempView("orc_source") { + sql( +s""" + |CREATE TEMPORARY VIEW orc_source + |USING org.apache.spark.sql.hive.orc + |OPTIONS ( + | PATH '${dir.getCanonicalPath}' + |) + """.stripMargin) + + checkAnswer( +sql("select * from orc_source"), +sql("select * from test_insert_table").collect()) +} + } +} + } + + test("insert overwrite to dir from temp table") { +withTempView("test_insert_table") { + spark.range(10).selectExpr("id", "id AS str").createOrReplaceTempView("test_insert_table") + + withTempDir { dir => +val pathUri = dir.toURI + +sql( + s""" + |INSERT OVERWRITE DIRECTORY '${pathUri}' + |ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' + |SELECT * FROM test_insert_table + """.stripMargin) + +sql( + s""" + |INSERT OVERWRITE DIRECTORY '${pathUri}' + |STORED AS orc + |SELECT * FROM test_insert_table + """.stripMargin) + +// use orc data source to check the data of path is right. +withTempView("orc_source") { + sql( +s""" + |CREATE TEMPORARY VIEW orc_source + |USING org.apache.spark.sql.hive.orc + |OPTIONS ( + | PATH '${dir.getCanonicalPath}' + |) + """.stripMargin) + + checkAnswer( +sql("select * from orc_source"), --- End diff -- You do not need the temp view `orc_source `. Below is an example. ```Scala checkAnswer( spark.read.orc(dir.getCanonicalPath), sql("select * from test_insert_table")) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19114: Update PairRDDFunctions.scala
Github user awarrior closed the pull request at: https://github.com/apache/spark/pull/19114 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18975#discussion_r136739979 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala --- @@ -534,4 +534,132 @@ class InsertIntoHiveTableSuite extends QueryTest with TestHiveSingleton with Bef } } } + + test("insert overwrite to dir from hive metastore table") { +withTempDir { dir => + val path = dir.toURI.getPath + + sql(s"INSERT OVERWRITE LOCAL DIRECTORY '${path}' SELECT * FROM src where key < 10") + + sql( +s""" + |INSERT OVERWRITE LOCAL DIRECTORY '${path}' + |STORED AS orc + |SELECT * FROM src where key < 10 + """.stripMargin) + + // use orc data source to check the data of path is right. + withTempView("orc_source") { +sql( + s""" + |CREATE TEMPORARY VIEW orc_source + |USING org.apache.spark.sql.hive.orc + |OPTIONS ( + | PATH '${dir.getCanonicalPath}' + |) + """.stripMargin) + +checkAnswer( + sql("select * from orc_source"), + sql("select * from src where key < 10").collect()) + } +} + } + + test("insert overwrite to local dir from temp table") { +withTempView("test_insert_table") { + spark.range(10).selectExpr("id", "id AS str").createOrReplaceTempView("test_insert_table") + + withTempDir { dir => +val path = dir.toURI.getPath + +sql( + s""" + |INSERT OVERWRITE LOCAL DIRECTORY '${path}' + |ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' + |SELECT * FROM test_insert_table + """.stripMargin) + +sql( + s""" + |INSERT OVERWRITE LOCAL DIRECTORY '${path}' + |STORED AS orc + |SELECT * FROM test_insert_table + """.stripMargin) + +// use orc data source to check the data of path is right. +withTempView("orc_source") { + sql( +s""" + |CREATE TEMPORARY VIEW orc_source + |USING org.apache.spark.sql.hive.orc + |OPTIONS ( + | PATH '${dir.getCanonicalPath}' + |) + """.stripMargin) + + checkAnswer( +sql("select * from orc_source"), +sql("select * from test_insert_table").collect()) --- End diff -- The same here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18975#discussion_r136739964 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala --- @@ -534,4 +534,132 @@ class InsertIntoHiveTableSuite extends QueryTest with TestHiveSingleton with Bef } } } + + test("insert overwrite to dir from hive metastore table") { +withTempDir { dir => + val path = dir.toURI.getPath + + sql(s"INSERT OVERWRITE LOCAL DIRECTORY '${path}' SELECT * FROM src where key < 10") + + sql( +s""" + |INSERT OVERWRITE LOCAL DIRECTORY '${path}' + |STORED AS orc + |SELECT * FROM src where key < 10 + """.stripMargin) + + // use orc data source to check the data of path is right. + withTempView("orc_source") { +sql( + s""" + |CREATE TEMPORARY VIEW orc_source + |USING org.apache.spark.sql.hive.orc + |OPTIONS ( + | PATH '${dir.getCanonicalPath}' + |) + """.stripMargin) + +checkAnswer( + sql("select * from orc_source"), + sql("select * from src where key < 10").collect()) --- End diff -- Nit: `.collect()` can be removed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18975#discussion_r136739990 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala --- @@ -534,4 +534,132 @@ class InsertIntoHiveTableSuite extends QueryTest with TestHiveSingleton with Bef } } } + + test("insert overwrite to dir from hive metastore table") { +withTempDir { dir => + val path = dir.toURI.getPath + + sql(s"INSERT OVERWRITE LOCAL DIRECTORY '${path}' SELECT * FROM src where key < 10") + + sql( +s""" + |INSERT OVERWRITE LOCAL DIRECTORY '${path}' + |STORED AS orc + |SELECT * FROM src where key < 10 + """.stripMargin) + + // use orc data source to check the data of path is right. + withTempView("orc_source") { +sql( + s""" + |CREATE TEMPORARY VIEW orc_source + |USING org.apache.spark.sql.hive.orc + |OPTIONS ( + | PATH '${dir.getCanonicalPath}' + |) + """.stripMargin) + +checkAnswer( + sql("select * from orc_source"), + sql("select * from src where key < 10").collect()) + } +} + } + + test("insert overwrite to local dir from temp table") { +withTempView("test_insert_table") { + spark.range(10).selectExpr("id", "id AS str").createOrReplaceTempView("test_insert_table") + + withTempDir { dir => +val path = dir.toURI.getPath + +sql( + s""" + |INSERT OVERWRITE LOCAL DIRECTORY '${path}' + |ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' + |SELECT * FROM test_insert_table + """.stripMargin) + +sql( + s""" + |INSERT OVERWRITE LOCAL DIRECTORY '${path}' + |STORED AS orc + |SELECT * FROM test_insert_table + """.stripMargin) + +// use orc data source to check the data of path is right. +withTempView("orc_source") { + sql( +s""" + |CREATE TEMPORARY VIEW orc_source + |USING org.apache.spark.sql.hive.orc + |OPTIONS ( + | PATH '${dir.getCanonicalPath}' + |) + """.stripMargin) + + checkAnswer( +sql("select * from orc_source"), +sql("select * from test_insert_table").collect()) +} + } +} + } + + test("insert overwrite to dir from temp table") { +withTempView("test_insert_table") { + spark.range(10).selectExpr("id", "id AS str").createOrReplaceTempView("test_insert_table") + + withTempDir { dir => +val pathUri = dir.toURI + +sql( + s""" + |INSERT OVERWRITE DIRECTORY '${pathUri}' + |ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' + |SELECT * FROM test_insert_table + """.stripMargin) + +sql( + s""" + |INSERT OVERWRITE DIRECTORY '${pathUri}' + |STORED AS orc + |SELECT * FROM test_insert_table + """.stripMargin) + +// use orc data source to check the data of path is right. +withTempView("orc_source") { + sql( +s""" + |CREATE TEMPORARY VIEW orc_source + |USING org.apache.spark.sql.hive.orc + |OPTIONS ( + | PATH '${dir.getCanonicalPath}' + |) + """.stripMargin) + + checkAnswer( +sql("select * from orc_source"), +sql("select * from test_insert_table").collect()) --- End diff -- The same here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/18902 +1 for using Dataframe-based version code. @zhengruifeng One thing I want to confirm is that, I check your testing code, both RDD-based version and Dataframe-based version code will both cost on deserialization: ``` ... val df = spark.createDataFrame(rows, struct) df.persist() df.count() ... // do `imputer.fit` ``` when running `imputer.fit`, it will extract the required columns from the cached input dataframe, and then you compare the perf between `RDD.aggregate` and `dataframe avg`, they both need to deserialize data from input and then do computation, and `dataframe avg` will take advantage of codegen and should be faster. But here the test show that RDD version is slower than Dataframe version, it is not very reasonable, so I want to confirm: in your RDD version testing, do you cache again when get `RDD` from the input `Dataframe`? If not, your testing has no problem, I will guess there exists other performance issue in SQL layer and cc @cloud-fan to take a look. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19086: [SPARK-21874][SQL] Support changing database when rename...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19086 **[Test build #81370 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81370/testReport)** for PR 19086 at commit [`d4a0f97`](https://github.com/apache/spark/commit/d4a0f97b9600dd661d81a79e71019a781eb812ff). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19086: [SPARK-21874][SQL] Support changing database when rename...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/19086 @gatorsmile I updated, let me known if there's still comments not resolved. Thanks again for review. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/18865 @HyukjinKwon 's provided use case looks pretty fair. The corrupt record is the whole line which doesn't follow the json format. It is kind of different to the corrupt record case that some json fields can't be correctly converted to desired data type. This two kind of corrupt records can be mixed in one json file. E.g., echo '{"field": 1 {"field" 2} {"field": 3} {"field": "4"}' >/tmp/sample.json scala> dfFromFile.show(false) +-+---+ |field|_corrupt_record| +-+---+ |null |{"field": 1| |null | {"field" 2} | |3|null | |null |{"field": "4"} | +-+---+ scala> dfFromFile.select($"_corrupt_record").show() +---+ |_corrupt_record| +---+ |{"field": 1| |{"field" 2}| | null| | null| +---+ At least we should clearly explain the difference in the error message. Maybe something like: The query to execute now requires only `_corrupt_record` in effect after optimization. When there are corrupt records due to json field conversion error, those corrupt records might not correctly generated in the end, because now no other json fields are required along actually. In order to obtain most accurate result, we recommend users to cache or save the dataset before the queries requiring only `_corrupt_record`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18975#discussion_r136739126 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveDirCommand.scala --- @@ -0,0 +1,145 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive.execution + +import scala.language.existentials + +import org.apache.hadoop.fs.{FileSystem, Path} +import org.apache.hadoop.hive.common.FileUtils +import org.apache.hadoop.hive.ql.plan.TableDesc +import org.apache.hadoop.hive.serde.serdeConstants +import org.apache.hadoop.hive.serde2.`lazy`.LazySimpleSerDe +import org.apache.hadoop.mapred._ + +import org.apache.spark.sql.{Row, SparkSession} +import org.apache.spark.sql.catalyst.TableIdentifier +import org.apache.spark.sql.catalyst.catalog.{CatalogStorageFormat, CatalogTable} +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan +import org.apache.spark.sql.execution.SparkPlan +import org.apache.spark.sql.hive.client.HiveClientImpl + +/** + * Command for writing the results of `query` to file system. + * + * The syntax of using this command in SQL is: + * {{{ + * INSERT OVERWRITE [LOCAL] DIRECTORY + * path + * [ROW FORMAT row_format] + * [STORED AS file_format] + * SELECT ... + * }}} + * + * @param isLocal whether the path specified in `storage` is a local directory + * @param storage storage format used to describe how the query result is stored. + * @param query the logical plan representing data to write to + * @param overwrite whthere overwrites existing directory + */ +case class InsertIntoHiveDirCommand( +isLocal: Boolean, +storage: CatalogStorageFormat, +query: LogicalPlan, +overwrite: Boolean) extends SaveAsHiveFile { + + override def children: Seq[LogicalPlan] = query :: Nil + + override def run(sparkSession: SparkSession, children: Seq[SparkPlan]): Seq[Row] = { +assert(children.length == 1) +assert(storage.locationUri.nonEmpty) + +//val Array(cols, types) = children.head.output.foldLeft(Array("", "")) { case (r, a) => +// r(0) = r(0) + a.name + "," +// r(1) = r(1) + a.dataType.catalogString + ":" +// r +//} +// +//val properties = new Properties() +//properties.put("columns", cols.dropRight(1)) +//properties.put("columns.types", types.dropRight(1)) +//properties.put(serdeConstants.SERIALIZATION_LIB, +// storage.serde.getOrElse(classOf[LazySimpleSerDe].getName)) +// +//import scala.collection.JavaConverters._ +//properties.putAll(storage.properties.asJava) +// +//val tableDesc = new TableDesc( +// Utils.classForName(storage.inputFormat.get).asInstanceOf[Class[_ <: InputFormat[_, _]]], +// Utils.classForName(storage.outputFormat.get), +// properties +//) --- End diff -- Yes. The following dummy table looks fine to me. Could you remove the above lines? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19079: [SPARK-21859][CORE] Fix SparkFiles.get failed on ...
Github user lgrcyanny closed the pull request at: https://github.com/apache/spark/pull/19079 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19102: [SPARK-21859][CORE] Fix SparkFiles.get failed on ...
Github user lgrcyanny commented on a diff in the pull request: https://github.com/apache/spark/pull/19102#discussion_r136738525 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala --- @@ -496,7 +496,7 @@ object SparkSubmit extends CommandLineUtils with Logging { sysProp = "spark.executor.memory"), OptionAssigner(args.totalExecutorCores, STANDALONE | MESOS, ALL_DEPLOY_MODES, sysProp = "spark.cores.max"), - OptionAssigner(args.files, LOCAL | STANDALONE | MESOS, ALL_DEPLOY_MODES, + OptionAssigner(args.files, ALL_CLUSTER_MGRS, ALL_DEPLOY_MODES, --- End diff -- For yarn-client mode, --files are are already added to "spark.yarn.dist.files", I agree with you that, just addFile in SparkContext for yarn-client mode for "spark.yarn.dist.files". BTW, I will fix the doc for Spark-Submit either. Thanks @vanzin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19115: Update PairRDDFunctions.scala
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19115 @awarrior please follow the [doc](https://spark.apache.org/contributing.html) to submit patch. You need to change the PR title like other PRs by adding JIRA id and component tag. Add the details of your problem in PR description, not just a simple JIRA link. Submit PR against master branch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18975#discussion_r136738263 --- Diff: sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 --- @@ -241,11 +241,21 @@ query : ctes? queryNoWith ; -insertInto +insertIntoTable --- End diff -- These can be combined to ``` insertInto : INSERT OVERWRITE TABLE tableIdentifier (partitionSpec (IF NOT EXISTS)?)? #insertIntoTable | INSERT INTO TABLE? tableIdentifier partitionSpec? #insertIntoTable | INSERT OVERWRITE LOCAL? DIRECTORY path=STRING rowFormat? createFileFormat?#insertOverwriteHiveDir | INSERT OVERWRITE LOCAL? DIRECTORY (path=STRING)? tableProvider (OPTIONS options=tablePropertyList)? #insertOverwriteDir ; ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18975#discussion_r136738182 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala --- @@ -178,11 +179,50 @@ class AstBuilder(conf: SQLConf) extends SqlBaseBaseVisitor[AnyRef] with Logging } /** - * Add an INSERT INTO [TABLE]/INSERT OVERWRITE TABLE operation to the logical plan. + * Parameters used for writing query to a directory: (isLocal, CatalogStorageFormat, provider). + */ + type InsertDirParams = (Boolean, CatalogStorageFormat, Option[String]) + + /** + * Add an + * INSERT INTO [TABLE] or + * INSERT OVERWRITE TABLE or + * INSERT OVERWRITE [LOCAL] DIRECTORY + * operation to logical plan */ private def withInsertInto( ctx: InsertIntoContext, query: LogicalPlan): LogicalPlan = withOrigin(ctx) { +assert(ctx.children.size == 1) --- End diff -- It can be rewritten to ```Scala private def withInsertInto( ctx: InsertIntoContext, query: LogicalPlan): LogicalPlan = withOrigin(ctx) { ctx match { case c: InsertIntoTableContext => withInsertIntoTable(c, query) case dir: InsertOverwriteDirContext => val (isLocal, storage, provider) = visitInsertOverwriteDir(dir) InsertIntoDir(isLocal, storage, provider, query, overwrite = true) case hiveDir: InsertOverwriteHiveDirContext => val (isLocal, storage, provider) = visitInsertOverwriteHiveDir(hiveDir) InsertIntoDir(isLocal, storage, provider, query, overwrite = true) } } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/17014 @WeichenXu123 @jkbradley I am curious about why `ml.Kmeans` is special that it needs a separate PR --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18538 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81369/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18538 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18538 **[Test build #81369 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81369/testReport)** for PR 18538 at commit [`9abe9e5`](https://github.com/apache/spark/commit/9abe9e560ae12405a480eab325f7a707e8cb1f14). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19115: Update PairRDDFunctions.scala
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19115 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19114: Update PairRDDFunctions.scala
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19114 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19115: Update PairRDDFunctions.scala
GitHub user awarrior opened a pull request: https://github.com/apache/spark/pull/19115 Update PairRDDFunctions.scala [https://issues.apache.org/jira/browse/SPARK-21882](url) You can merge this pull request into a Git repository by running: $ git pull https://github.com/awarrior/spark branch-2.2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19115.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19115 commit a096970b2f2cfa497a96870ebd26f83a106b4e07 Author: JarvisDate: 2017-09-04T02:48:35Z Update PairRDDFunctions.scala spark-21882 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17014: [SPARK-18608][ML] Fix double-caching in ML algori...
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/17014#discussion_r136737427 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -304,16 +304,14 @@ class KMeans @Since("1.5.0") ( override def fit(dataset: Dataset[_]): KMeansModel = { transformSchema(dataset.schema, logging = true) -val handlePersistence = dataset.rdd.getStorageLevel == StorageLevel.NONE +val handlePersistence = dataset.storageLevel == StorageLevel.NONE val instances: RDD[OldVector] = dataset.select(col($(featuresCol))).rdd.map { case Row(point: Vector) => OldVectors.fromML(point) } -if (handlePersistence) { - instances.persist(StorageLevel.MEMORY_AND_DISK) -} +if (handlePersistence) instances.persist(StorageLevel.MEMORY_AND_DISK) --- End diff -- OK, I will revert those small changes --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19114: Update PairRDDFunctions.scala
GitHub user awarrior opened a pull request: https://github.com/apache/spark/pull/19114 Update PairRDDFunctions.scala [https://issues.apache.org/jira/browse/SPARK-21882](url) You can merge this pull request into a Git repository by running: $ git pull https://github.com/awarrior/spark branch-1.6 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19114.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19114 commit e7e42802b07c5148ba02761af1edd2ee81d6ef95 Author: JarvisDate: 2017-09-04T02:52:01Z Update PairRDDFunctions.scala spark-21882 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18957: [SPARK-21744][CORE] Add retry logic for new broad...
Github user caneGuy closed the pull request at: https://github.com/apache/spark/pull/18957 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19111: [SPARK-21801][SPARKR][TEST][WIP] set random seed for pre...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19111 I found `NaiveBayes` also possible to fail. Founded here #18538 . Hope this can works! https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81316/console ``` 2. Error: spark.naiveBayes (@test_mllib_classification.R#482) -- java.lang.IllegalArgumentException: requirement failed: The input column stridx_5a965aeed344 should have at least two distinct values. at scala.Predef$.require(Predef.scala:224) at org.apache.spark.ml.feature.OneHotEncoder$$anonfun$5.apply(OneHotEncoder.scala:113) ... 1: spark.naiveBayes(traindf, clicked ~ ., smoothing = 0) at /home/jenkins/workspace/SparkPullRequestBuilder/R/pkg/tests/fulltests/test_mllib_classification.R:482 2: spark.naiveBayes(traindf, clicked ~ ., smoothing = 0) 3: .local(data, formula, ...) 4: callJStatic("org.apache.spark.ml.r.NaiveBayesWrapper", "fit", formula, data@sdf, smoothing, handleInvalid) 5: invokeJava(isStatic = TRUE, className, methodName, ...) 6: handleErrors(returnStatus, conn) 7: stop(readString(conn)) DONE === Error: Test failures Execution halted ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18538 **[Test build #81369 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81369/testReport)** for PR 18538 at commit [`9abe9e5`](https://github.com/apache/spark/commit/9abe9e560ae12405a480eab325f7a707e8cb1f14). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/18538 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19079: [SPARK-21859][CORE] Fix SparkFiles.get failed on driver ...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19079 Please close this PR @lgrcyanny thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19113: [SPARK-20978][SQL] Bump up Univocity version to 2.5.4
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19113 **[Test build #81368 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81368/testReport)** for PR 19113 at commit [`fa7eb51`](https://github.com/apache/spark/commit/fa7eb514cfd91bb405fd74680b08d5865911e3f0). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19113: [SPARK-20978][SQL] Bump up Univocity version to 2...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/19113 [SPARK-20978][SQL] Bump up Univocity version to 2.5.4 ## What changes were proposed in this pull request? There was a bug in Univocity Parser that causes the issue in SPARK-20978. This was fixed as below: ```scala val df = spark.read.schema("a string, b string, unparsed string").option("columnNameOfCorruptRecord", "unparsed").csv(Seq("a").toDS()) df.show() ``` **Before** ``` java.lang.NullPointerException at scala.collection.immutable.StringLike$class.stripLineEnd(StringLike.scala:89) at scala.collection.immutable.StringOps.stripLineEnd(StringOps.scala:29) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$getCurrentInput(UnivocityParser.scala:56) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert$1.apply(UnivocityParser.scala:207) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert$1.apply(UnivocityParser.scala:207) ... ``` **After** ``` +---+++ | a| b|unparsed| +---+++ | a|null| a| +---+++ ``` It was fixed in 2.5.0 and 2.5.4 was released. I guess it'd be safe to upgrade this. ## How was this patch tested? Unit test added in `CSVSuite.scala`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark bump-up-univocity Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19113.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19113 commit fa7eb514cfd91bb405fd74680b08d5865911e3f0 Author: hyukjinkwonDate: 2017-08-26T03:52:40Z Bump up Univocity version to 2.5.4 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squared_hin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17862 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squared_hin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17862 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81367/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squared_hin...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17862 **[Test build #81367 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81367/testReport)** for PR 17862 at commit [`cec628b`](https://github.com/apache/spark/commit/cec628ba094fcbaad2fad4a7c5957aa9f5d3d698). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squared_hin...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17862 **[Test build #81367 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81367/testReport)** for PR 17862 at commit [`cec628b`](https://github.com/apache/spark/commit/cec628ba094fcbaad2fad4a7c5957aa9f5d3d698). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18975: [SPARK-4131] Support "Writing data into the filesystem f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18975 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18975: [SPARK-4131] Support "Writing data into the filesystem f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18975 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81366/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18975: [SPARK-4131] Support "Writing data into the filesystem f...
Github user tejasapatil commented on the issue: https://github.com/apache/spark/pull/18975 There is a difference in Hive's semantics vs what this PR is doing. In Hive, the query execution writes to a staging location and the destination location is cleared + re-populated after the end of query execution (it happens in `MoveTask`). This PR will first wipe the destination location and then perform the query execution to populate the destination location with desired data. I like the hive model because: - If the query execution fails, you atleast have the old data. Insert overwrite to table does the same thing. This PR will leave the output location empty. - Hive achieves atomicity by using a staging dir. With this PR, I am not sure what happens to the output location if the some tasks have written out the final data while rest are still working. Would it have partial output ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18975: [SPARK-4131] Support "Writing data into the filesystem f...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18975 **[Test build #81366 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81366/testReport)** for PR 18975 at commit [`2ec9947`](https://github.com/apache/spark/commit/2ec9947af571691966e979e17aec12d3d683decf). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/18975#discussion_r136727120 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveDirCommand.scala --- @@ -0,0 +1,145 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive.execution + +import scala.language.existentials + +import org.apache.hadoop.fs.{FileSystem, Path} +import org.apache.hadoop.hive.common.FileUtils +import org.apache.hadoop.hive.ql.plan.TableDesc +import org.apache.hadoop.hive.serde.serdeConstants +import org.apache.hadoop.hive.serde2.`lazy`.LazySimpleSerDe +import org.apache.hadoop.mapred._ + +import org.apache.spark.sql.{Row, SparkSession} +import org.apache.spark.sql.catalyst.TableIdentifier +import org.apache.spark.sql.catalyst.catalog.{CatalogStorageFormat, CatalogTable} +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan +import org.apache.spark.sql.execution.SparkPlan +import org.apache.spark.sql.hive.client.HiveClientImpl + +/** + * Command for writing the results of `query` to file system. + * + * The syntax of using this command in SQL is: + * {{{ + * INSERT OVERWRITE [LOCAL] DIRECTORY + * path + * [ROW FORMAT row_format] + * [STORED AS file_format] + * SELECT ... + * }}} + * + * @param isLocal whether the path specified in `storage` is a local directory + * @param storage storage format used to describe how the query result is stored. + * @param query the logical plan representing data to write to + * @param overwrite whthere overwrites existing directory + */ +case class InsertIntoHiveDirCommand( +isLocal: Boolean, +storage: CatalogStorageFormat, +query: LogicalPlan, +overwrite: Boolean) extends SaveAsHiveFile { + + override def children: Seq[LogicalPlan] = query :: Nil + + override def run(sparkSession: SparkSession, children: Seq[SparkPlan]): Seq[Row] = { +assert(children.length == 1) +assert(storage.locationUri.nonEmpty) + +//val Array(cols, types) = children.head.output.foldLeft(Array("", "")) { case (r, a) => --- End diff -- deadcode ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/18975#discussion_r136727065 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/InsertIntoDataSourceDirCommand.scala --- @@ -0,0 +1,81 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.command + +import org.apache.spark.SparkException +import org.apache.spark.sql._ +import org.apache.spark.sql.catalyst.catalog._ +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan +import org.apache.spark.sql.execution.datasources._ + +/** + * A command used to write the result of a query to a directory. + * + * The syntax of using this command in SQL is: + * {{{ + * INSERT OVERWRITE DIRECTORY (path=STRING)? + * USING format OPTIONS ([option1_name "option1_value", option2_name "option2_value", ...]) + * SELECT ... + * }}} + * + * @param storage storage format used to describe how the query result is stored. + * @param provider the data source type to be used + * @param query the logical plan representing data to write to + * @param overwrite whthere overwrites existing directory + */ +case class InsertIntoDataSourceDirCommand( +storage: CatalogStorageFormat, +provider: String, +query: LogicalPlan, +overwrite: Boolean) extends RunnableCommand { + + override def innerChildren: Seq[LogicalPlan] = Seq(query) + + override def run(sparkSession: SparkSession): Seq[Row] = { +assert(innerChildren.length == 1) +assert(storage.locationUri.nonEmpty, "Directory path is required") +assert(!provider.isEmpty, "Data source is required") + +// Create the relation based on the input logical plan: `query`. +val pathOption = storage.locationUri.map("path" -> CatalogUtils.URIToString(_)) + +val dataSource = DataSource( + sparkSession, + className = provider, + options = storage.properties ++ pathOption, + catalogTable = None) + +val isFileFormat = classOf[FileFormat].isAssignableFrom(dataSource.providingClass) +if (!isFileFormat) { + throw new SparkException( +"Only Data Sources providing FileFormat are supported.") --- End diff -- nit: you could also include `dataSource.providingClass` in there to make it easy to reason / debug. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/18975#discussion_r136726991 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala --- @@ -178,11 +179,50 @@ class AstBuilder(conf: SQLConf) extends SqlBaseBaseVisitor[AnyRef] with Logging } /** - * Add an INSERT INTO [TABLE]/INSERT OVERWRITE TABLE operation to the logical plan. + * Parameters used for writing query to a directory: (isLocal, CatalogStorageFormat, provider). + */ + type InsertDirParams = (Boolean, CatalogStorageFormat, Option[String]) + + /** + * Add an + * INSERT INTO [TABLE] or + * INSERT OVERWRITE TABLE or + * INSERT OVERWRITE [LOCAL] DIRECTORY + * operation to logical plan */ private def withInsertInto( ctx: InsertIntoContext, query: LogicalPlan): LogicalPlan = withOrigin(ctx) { +assert(ctx.children.size == 1) --- End diff -- @janewangfb : thanks for the explanation. I got confused if this would have affected multi-inserts. ie. ``` FROM from_statement INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1 [INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ... ``` One can even mix in `INSERT OVERWRITE TABLE ... ` in hive. You should have a test case for that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19090: [SPARK-21877][DEPLOY, WINDOWS] Handle quotes in Windows ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19090 Thanks for thorough testing. Yea, looks fine. Will take a look few times more by myself. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/18975#discussion_r136726921 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala --- @@ -1509,4 +1509,86 @@ class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder(conf) { query: LogicalPlan): LogicalPlan = { RepartitionByExpression(expressions, query, conf.numShufflePartitions) } + + /** + * Return the parameters for [[InsertIntoDir]] logical plan. + * + * Expected format: + * {{{ + * INSERT OVERWRITE DIRECTORY + * [path] + * [OPTIONS table_property_list] + * select_statement; + * }}} + */ + override def visitInsertOverwriteDir( + ctx: InsertOverwriteDirContext): InsertDirParams = withOrigin(ctx) { +if (ctx.LOCAL != null) { + throw new ParseException( +"LOCAL is not supported in INSERT OVERWRITE DIRECTORY to data source", ctx) +} + +val options = Option(ctx.options).map(visitPropertyKeyValues).getOrElse(Map.empty) +var storage = DataSource.buildStorageFormatFromOptions(options) + +val path = Option(ctx.path).map(string).getOrElse("") + +if (!path.isEmpty && storage.locationUri.isDefined) { + throw new ParseException( +"Directory path and 'path' in OPTIONS are both used to indicate the directory path, " + + "you can only specify one of them.", ctx) +} +if (path.isEmpty && storage.locationUri.isEmpty) { --- End diff -- you can collapse this check and one above by doing XOR --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/18975#discussion_r136726866 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala --- @@ -1509,4 +1509,86 @@ class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder(conf) { query: LogicalPlan): LogicalPlan = { RepartitionByExpression(expressions, query, conf.numShufflePartitions) } + + /** + * Return the parameters for [[InsertIntoDir]] logical plan. + * + * Expected format: + * {{{ + * INSERT OVERWRITE DIRECTORY + * [path] + * [OPTIONS table_property_list] + * select_statement; + * }}} + */ + override def visitInsertOverwriteDir( + ctx: InsertOverwriteDirContext): InsertDirParams = withOrigin(ctx) { +if (ctx.LOCAL != null) { + throw new ParseException( +"LOCAL is not supported in INSERT OVERWRITE DIRECTORY to data source", ctx) +} + +val options = Option(ctx.options).map(visitPropertyKeyValues).getOrElse(Map.empty) +var storage = DataSource.buildStorageFormatFromOptions(options) + +val path = Option(ctx.path).map(string).getOrElse("") + +if (!path.isEmpty && storage.locationUri.isDefined) { + throw new ParseException( +"Directory path and 'path' in OPTIONS are both used to indicate the directory path, " + + "you can only specify one of them.", ctx) +} +if (path.isEmpty && storage.locationUri.isEmpty) { + throw new ParseException( +"You need to specify directory path or 'path' in OPTIONS, but not both", ctx) --- End diff -- nit: `but not both` does not seem in sync with the check because both `path` and `locationUri` are not specified for the check to be true. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19112: [SPARK-21901][SS] Define toString for StateOperat...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19112#discussion_r136726682 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/streaming/progress.scala --- @@ -177,11 +179,11 @@ class SourceProgress protected[sql]( } ("description" -> JString(description)) ~ - ("startOffset" -> tryParse(startOffset)) ~ - ("endOffset" -> tryParse(endOffset)) ~ - ("numInputRows" -> JInt(numInputRows)) ~ - ("inputRowsPerSecond" -> safeDoubleToJValue(inputRowsPerSecond)) ~ - ("processedRowsPerSecond" -> safeDoubleToJValue(processedRowsPerSecond)) --- End diff -- Yep. I guess that those are the less frequent style. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18865: [SPARK-21610][SQL] Corrupt records are not handled prope...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18865 A use case might be: ``` echo '{"field": 1 {"field" 2} {"field": 3}' >/tmp/sample.json ``` ```scala val file = "/tmp/sample.json" val dfFromFile = spark.read.schema("field BYTE, _corrupt_record STRING").json(file) dfFromFile.show(false) dfFromFile.filter($"_corrupt_record".isNotNull).count() dfFromFile.filter($"_corrupt_record".isNull).count() dfFromFile.select($"_corrupt_record").show() ``` ``` scala> dfFromFile.show(false) +-+---+ |field|_corrupt_record| +-+---+ |null |{"field": 1| |null | {"field" 2} | |3|null | +-+---+ scala> dfFromFile.filter($"_corrupt_record".isNotNull).count() res1: Long = 2 scala> dfFromFile.filter($"_corrupt_record".isNull).count() res2: Long = 1 scala> dfFromFile.select($"_corrupt_record").show() +---+ |_corrupt_record| +---+ |{"field": 1| |{"field" 2}| | null| +---+ ``` I was thinking of avoiding to blacklist a case which works before without an exception but it looks we (at least I) are less sure if it is a bug. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19060: [WIP][SQL] Add DataSourceSuite validating data sources l...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19060 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19060: [WIP][SQL] Add DataSourceSuite validating data sources l...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19060 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81365/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19060: [WIP][SQL] Add DataSourceSuite validating data sources l...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19060 **[Test build #81365 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81365/testReport)** for PR 19060 at commit [`fb72c89`](https://github.com/apache/spark/commit/fb72c89478149da99bd7ac402257ab7468156f9d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org