[GitHub] spark issue #19113: [SPARK-20978][SQL] Bump up Univocity version to 2.5.4
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19113 This release of Univocity was just out a few days ago. To me, this sound risky. We normally do not upgrade it to the latest version. This is why we are not using Parquet 1.9.0. Instead, we asking Parquet community to release 1.8.2. cc @rxin @marmbrus @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19128: Merge pull request #1 from apache/master
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19128 Looks mistakenly open. Could you close this please? @sphinx-jiang --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19119: [SPARK-21845] [SQL] [test-maven] Make codegen fallback o...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19119 Sure --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19116: [SPARK-21903][BUILD] Upgrade scalastyle to 1.0.0.
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19116#discussion_r136905307 --- Diff: project/SparkBuild.scala --- @@ -163,14 +163,15 @@ object SparkBuild extends PomBuild { val configUrlV = scalastyleConfigUrl.in(config).value val streamsV = streams.in(config).value val failOnErrorV = true +val failOnWarningV = false --- End diff -- I've been noticed this, but looks like we only have error level rules in `scalastyle-config.xml`. Not aware of a rewriting hidden there. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18975: [SPARK-4131] Support "Writing data into the filesystem f...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18975 I am not opposing the solution of the staging directory, but just want to understand what kind of services we can guarantee. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18975: [SPARK-4131] Support "Writing data into the filesystem f...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18975 Even renaming staging location is not atomic, right? Hive users might still see the extra data during the rename, right? I am not sure how Hive works. If our underlying system is Linux, the [rename of Linux doc](https://linux.die.net/man/3/rename) shows > If one or more processes have the file open when the last link is removed, the link shall be removed before rename() returns, but the removal of the file contents shall be postponed until all references to the file are closed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19116: [SPARK-21903][BUILD] Upgrade scalastyle to 1.0.0.
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19116#discussion_r136904086 --- Diff: project/SparkBuild.scala --- @@ -163,14 +163,15 @@ object SparkBuild extends PomBuild { val configUrlV = scalastyleConfigUrl.in(config).value val streamsV = streams.in(config).value val failOnErrorV = true +val failOnWarningV = false --- End diff -- Thanks @viirya, actually I found this while double checking `println`. FWIW, I double checked `import scala.Predef.{println => _, _}` still fails. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19116: [SPARK-21903][BUILD] Upgrade scalastyle to 1.0.0.
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19116#discussion_r136903766 --- Diff: project/SparkBuild.scala --- @@ -163,14 +163,15 @@ object SparkBuild extends PomBuild { val configUrlV = scalastyleConfigUrl.in(config).value val streamsV = streams.in(config).value val failOnErrorV = true +val failOnWarningV = false --- End diff -- I disabled this as we did not do this before and, I found we actually looks replacing a case to `warn` on sbt compile and test - https://github.com/apache/spark/pull/19116/files#diff-c3580fe26fb42eb3aac6e180ae11e947R139 although every level in `scalastyle-config.xml` is `error`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19116: [SPARK-21903][BUILD] Upgrade scalastyle to 1.0.0.
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19116 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19116: [SPARK-21903][BUILD] Upgrade scalastyle to 1.0.0.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19116 **[Test build #81397 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81397/testReport)** for PR 19116 at commit [`db0dee5`](https://github.com/apache/spark/commit/db0dee5b34a865da43ac50db85fc4151af598959). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19128: Merge pull request #1 from apache/master
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19128 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19128: Merge pull request #1 from apache/master
GitHub user sphinx-jiang opened a pull request: https://github.com/apache/spark/pull/19128 Merge pull request #1 from apache/master 9.5 update ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sphinx-jiang/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19128.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19128 commit 4646ba85c8808f94c4f6b56edd222a276e82abd1 Author: sphinx-jiang Date: 2017-09-05T05:40:09Z Merge pull request #1 from apache/master 9.5 update --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19111: [SPARK-21801][SPARKR][TEST][WIP] set random seed for pre...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19111 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81395/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19111: [SPARK-21801][SPARKR][TEST][WIP] set random seed for pre...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19111 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19111: [SPARK-21801][SPARKR][TEST][WIP] set random seed for pre...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19111 **[Test build #81395 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81395/testReport)** for PR 19111 at commit [`5d156be`](https://github.com/apache/spark/commit/5d156be92fd3cfe8af30094fd759909ce5455d8f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19082: [SPARK-21870][SQL] Split aggregation code into sm...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/19082#discussion_r136900835 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala --- @@ -244,6 +246,89 @@ case class HashAggregateExec( protected override val shouldStopRequired = false + // Extracts all the input variable references for a given `aggExpr`. This result will be used + // to split aggregation into small functions. + private def getInputVariableReferences( + ctx: CodegenContext, + aggExpr: Expression, + subExprs: Map[Expression, SubExprEliminationState]): Set[(String, String)] = { +// `argSet` collects all the pairs of variable names and their types, the first in the pair is +// a type name and the second is a variable name. +val argSet = mutable.Set[(String, String)]() +val stack = mutable.Stack[Expression](aggExpr) +while (stack.nonEmpty) { + stack.pop() match { +case e if subExprs.contains(e) => + val exprCode = subExprs(e) + if (CodegenContext.isJavaIdentifier(exprCode.value)) { +argSet += ((ctx.javaType(e.dataType), exprCode.value)) + } + if (CodegenContext.isJavaIdentifier(exprCode.isNull)) { +argSet += (("boolean", exprCode.isNull)) + } + // Since the children possibly has common expressions, we push them here + stack.pushAll(e.children) +case ref: BoundReference +if ctx.currentVars != null && ctx.currentVars(ref.ordinal) != null => + val value = ctx.currentVars(ref.ordinal).value + val isNull = ctx.currentVars(ref.ordinal).isNull + if (CodegenContext.isJavaIdentifier(value)) { +argSet += ((ctx.javaType(ref.dataType), value)) + } + if (CodegenContext.isJavaIdentifier(isNull)) { +argSet += (("boolean", isNull)) + } +case _: BoundReference => + argSet += (("InternalRow", ctx.INPUT_ROW)) +case e => + stack.pushAll(e.children) + } +} + +argSet.toSet + } + + // Splits the aggregation into small functions because the HotSpot does not compile + // too long functions. + private def splitAggregateExpressions( + ctx: CodegenContext, + aggExprs: Seq[Expression], + evalAndUpdateCodes: Seq[String], + subExprs: Map[Expression, SubExprEliminationState], + otherArgs: Seq[(String, String)] = Seq.empty): Seq[String] = { +aggExprs.zipWithIndex.map { case (aggExpr, i) => + // The maximum number of parameters in Java methods is 255, so this method gives up splitting + // the code if the number goes over the limit. + // You can find more information about the limit in the JVM specification: + // - The number of method parameters is limited to 255 by the definition of a method + // descriptor, where the limit includes one unit for this in the case of instance + // or interface method invocations. + val args = (getInputVariableReferences(ctx, aggExpr, subExprs) ++ otherArgs).toSeq + + // This is for testing/benchmarking only + val maxParamNumInJavaMethod = + sqlContext.getConf("spark.sql.codegen.aggregate.maxParamNumInJavaMethod", null) match { +case null | "" => 255 --- End diff -- If line 314 uses `<=`, this should be 254. In the previous commit, `<` is used. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19116: [SPARK-21903][BUILD] Upgrade scalastyle to 1.0.0.
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19116#discussion_r136899937 --- Diff: repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkILoop.scala --- @@ -19,7 +19,9 @@ package org.apache.spark.repl import java.io.BufferedReader +// scalastyle:off println import scala.Predef.{println => _, _} +// scalastyle:on println --- End diff -- I said it's weird because this obviously not a place to print out something. Not much harm actually. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19127: [SPARK-21916][SQL] Set isolationOn=true when create hive...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19127 **[Test build #81396 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81396/testReport)** for PR 19127 at commit [`2d13ab8`](https://github.com/apache/spark/commit/2d13ab8a18955e281033c17a446022aba57865f8). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19116: [SPARK-21903][BUILD] Upgrade scalastyle to 1.0.0.
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19116#discussion_r136898192 --- Diff: repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkILoop.scala --- @@ -19,7 +19,9 @@ package org.apache.spark.repl import java.io.BufferedReader +// scalastyle:off println import scala.Predef.{println => _, _} +// scalastyle:on println --- End diff -- This actually looks valid though. If I manually add ` import scala.Predef.{println => _, _}` somewhere not here, for example, `SQLConf` in the current master: ``` [error] .../spark/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:26:21: Are you sure you want to println? If yes, wrap the code block with [error] // scalastyle:off println [error] println(...) [error] // scalastyle:on println [error] .../spark/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:26:0: scala.Predef. is in wrong order relative to scala.collection.immutable. ``` It looks recognising this as an error. Looks 1.0.0 fixes an issue about those style checking and detection. We might have to fix `println` token checker rule but I guess this should be orthogonal anyway. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19127: [SPARK-21916][SQL] Set isolationOn=true when crea...
GitHub user jinxing64 opened a pull request: https://github.com/apache/spark/pull/19127 [SPARK-21916][SQL] Set isolationOn=true when create hive client for metadata. ## What changes were proposed in this pull request? In current code, we set `isolationOn=!isCliSession()` when create hive client for metadata. However conf of `CliSessionState` points to local dummy metastore(https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala#L416). Using `CliSessionState`, we fail to get metadata from remote hive metastore. We can always set `isolationOn=true` when create hive clietnt for metadata ## How was this patch tested? Existing. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jinxing64/spark SPARK-21916 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19127.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19127 commit 2d13ab8a18955e281033c17a446022aba57865f8 Author: jinxing Date: 2017-09-05T05:28:06Z Set isolationOn=true when create hive client for metadata. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19111: [SPARK-21801][SPARKR][TEST][WIP] set random seed for pre...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19111 **[Test build #81395 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81395/testReport)** for PR 19111 at commit [`5d156be`](https://github.com/apache/spark/commit/5d156be92fd3cfe8af30094fd759909ce5455d8f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19111: [SPARK-21801][SPARKR][TEST][WIP] set random seed for pre...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/19111 jenkins, retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19116: [SPARK-21903][BUILD] Upgrade scalastyle to 1.0.0.
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19116#discussion_r136894278 --- Diff: repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkILoop.scala --- @@ -19,7 +19,9 @@ package org.apache.spark.repl import java.io.BufferedReader +// scalastyle:off println import scala.Predef.{println => _, _} +// scalastyle:on println --- End diff -- Nit: This looks a bit weird. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17014 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17014 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81393/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17014 **[Test build #81393 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81393/testReport)** for PR 17014 at commit [`f8fa957`](https://github.com/apache/spark/commit/f8fa9573a1b40ff236e9c52cf429e2742c8f2bd0). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19112: [SPARK-21901][SS] Define toString for StateOperat...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/19112#discussion_r136892336 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/streaming/progress.scala --- @@ -200,7 +202,7 @@ class SourceProgress protected[sql]( */ @InterfaceStability.Evolving class SinkProgress protected[sql]( -val description: String) extends Serializable { --- End diff -- not a committer but would like to leave this suggestion : - codestyle changes are orthogonal to the motive of the PR and should be done separately. Generally, every PR should address one problem and not have changes unrelated to it. In event of revert or bisecting commits to pin-point regression, following this practice helps a lot. - It would be beneficial to see why checkstyle does not catch such instances and fix that (along with making all such instances consistent with the rules). Otherwise this would be a one off fix and we would continue to pile up similar inconsistencies in future development without anyone realising this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19124: [SPARK-21912][SQL] Creating ORC datasource table should ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19124 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19126: Model 1 and Model 2 ParamMaps Missing
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19126 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19124: [SPARK-21912][SQL] Creating ORC datasource table should ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19124 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81394/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19124: [SPARK-21912][SQL] Creating ORC datasource table should ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19124 **[Test build #81394 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81394/testReport)** for PR 19124 at commit [`a738943`](https://github.com/apache/spark/commit/a73894374d284484d9b28123db02dfe6f264567a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19126: Model 1 and Model 2 ParamMaps Missing
GitHub user marktab opened a pull request: https://github.com/apache/spark/pull/19126 Model 1 and Model 2 ParamMaps Missing The original Scala code says println("Model 2 was fit using parameters: " + model2.parent.extractParamMap) The parent is lr There is no method for accessing parent as is done in Scala. This code has been tested in Python, and returns values consistent with Scala ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/marktab/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19126.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19126 commit 76e5da7b14d71338cf82352e9cf5628640e732a2 Author: MarkTab marktab.net Date: 2017-09-05T03:26:07Z Model 1 and Model 2 ParamMaps Missing The original Scala code says println("Model 2 was fit using parameters: " + model2.parent.extractParamMap) The parent is lr There is no method for accessing parent as is done in Scala. This code has been tested in Python, and returns values consistent with Scala --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19121: [SPARK-21906][YARN][Spark Core]Don't runAsSparkUser to s...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19121 UGI is only used for security, normally it is used for Spark application to communicate with Hadoop using correct user. doAs already wraps the whole `CoarseGrainedExecutorBackend` process, all the task threads forked in this process will honor this UGI, don't need to wrap again on each task. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19121: [SPARK-21906][YARN][Spark Core]Don't runAsSparkUser to s...
Github user yaooqinn commented on the issue: https://github.com/apache/spark/pull/19121 @jerryshao 1. I didn't meet any problems, these codes are ok to run even if it is unnecessary. 2. In Standalone mode, if collaborating with a secured hdfs, we might haven't support yet. Besidesï¼this ugi `doAs` wraps executors' initialization but not tasks running, if we truly want to `doAs` a `SPARK_USER`, this ugi may be used in both phases. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19125: [SPARK-21913][SQL][TEST] `withDatabase` should drop data...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19125 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81392/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19125: [SPARK-21913][SQL][TEST] `withDatabase` should drop data...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19125 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19125: [SPARK-21913][SQL][TEST] `withDatabase` should drop data...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19125 **[Test build #81392 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81392/testReport)** for PR 19125 at commit [`241d565`](https://github.com/apache/spark/commit/241d56563ed278828567eb8f78029a8e70e96c5d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19124: [SPARK-21912][SQL] Creating ORC datasource table should ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19124 **[Test build #81394 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81394/testReport)** for PR 19124 at commit [`a738943`](https://github.com/apache/spark/commit/a73894374d284484d9b28123db02dfe6f264567a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19121: [SPARK-21906][YARN][Spark Core]Don't runAsSparkUser to s...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19121 Can you please elaborate the problem you met, did you meet any unexpected behavior? The changes here get rid of env variable "SPARK_USER", this might be OK for yarn application, but what if user runs on standalone mode and explicitly set this "SPARK_USER", your changes seems break the semantics. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17014 **[Test build #81393 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81393/testReport)** for PR 17014 at commit [`f8fa957`](https://github.com/apache/spark/commit/f8fa9573a1b40ff236e9c52cf429e2742c8f2bd0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/17014 Jenkins, retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13794: [SPARK-15574][ML][PySpark] Python meta-algorithms in Sca...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/13794 +1 @jkbradley For now it is better to keep the current implementation for the 4 meta-algo in pyspark. @yinxusen Would you mind to close this PR ? But I still appreciate your contribution for this! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] Creating ORC datasource table ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r136878102 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala --- @@ -169,6 +171,16 @@ class OrcFileFormat extends FileFormat with DataSourceRegister with Serializable } } } + + private def checkFieldName(name: String): Unit = { +// ,;{}()\n\t= and space are special characters in ORC schema --- End diff -- Thank you for review, @tejasapatil ! That's a good idea. Right, It's not an exhaustive list. I'll update the PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] Creating ORC datasource table ...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r136877087 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala --- @@ -169,6 +171,16 @@ class OrcFileFormat extends FileFormat with DataSourceRegister with Serializable } } } + + private def checkFieldName(name: String): Unit = { +// ,;{}()\n\t= and space are special characters in ORC schema --- End diff -- Is this exhaustive list ? eg. looks like `?` is not allowed either. Given that the underlying lib (ORC) can evolve to support / not support certain chars, its safer to reply on some method rather than coming up with a blacklist. Can you simply call `TypeInfoUtils.getTypeInfoFromTypeString` or any related method which would do this check ? ``` Caused by: java.lang.IllegalArgumentException: Error: : expected at the position 8 of 'struct' but '?' is found. at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:360) at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331) at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:483) at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305) at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfoFromTypeString(TypeInfoUtils.java:770) at org.apache.spark.sql.hive.orc.OrcSerializer.(OrcFileFormat.scala:194) at org.apache.spark.sql.hive.orc.OrcOutputWriter.(OrcFileFormat.scala:231) at org.apache.spark.sql.hive.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:91) ... ... ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19125: [SPARK-21913][SQL][TEST] `withDatabase` should drop data...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19125 **[Test build #81392 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81392/testReport)** for PR 19125 at commit [`241d565`](https://github.com/apache/spark/commit/241d56563ed278828567eb8f78029a8e70e96c5d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19125: [SPARK-21913][SQL][TEST] `withDatabase` should dr...
GitHub user dongjoon-hyun opened a pull request: https://github.com/apache/spark/pull/19125 [SPARK-21913][SQL][TEST] `withDatabase` should drop database with CASCADE ## What changes were proposed in this pull request? Currently, `withDatabase` fails if the database is not empty. It would be great if we drop cleanly with CASCADE. ## How was this patch tested? This is a change on test util. Pass the existing Jenkins. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dongjoon-hyun/spark SPARK-21913 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19125.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19125 commit 241d56563ed278828567eb8f78029a8e70e96c5d Author: Dongjoon Hyun Date: 2017-09-04T23:23:37Z [SPARK-21913][SQL][TEST] `withDatabase` should drop database with CASCADE --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18692: [SPARK-21417][SQL] Infer join conditions using propagate...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18692 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81390/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18692: [SPARK-21417][SQL] Infer join conditions using propagate...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18692 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18692: [SPARK-21417][SQL] Infer join conditions using propagate...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18692 **[Test build #81390 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81390/testReport)** for PR 18692 at commit [`cfeae46`](https://github.com/apache/spark/commit/cfeae46766a6ccb1b1a0113fe41cdb52b16897f3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19124: [SPARK-21912][SQL] Creating ORC datasource table should ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19124 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19124: [SPARK-21912][SQL] Creating ORC datasource table should ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19124 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81391/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19124: [SPARK-21912][SQL] Creating ORC datasource table should ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19124 **[Test build #81391 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81391/testReport)** for PR 19124 at commit [`808dfe0`](https://github.com/apache/spark/commit/808dfe0fcd9de2f43b33f0d1d084172b5624f2a8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19123: [SPARK-21418][SQL] NoSuchElementException: None.g...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19123 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19119: [SPARK-21845] [SQL] Make codegen fallback of expressions...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19119 Hi, @gatorsmile . Could you trigger Maven build, too? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19123: [SPARK-21418][SQL] NoSuchElementException: None.get in D...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/19123 LGTM, merging to master/2.2. Thanks for picking this up! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19123: [SPARK-21418][SQL] NoSuchElementException: None.get in D...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19123 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81388/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19123: [SPARK-21418][SQL] NoSuchElementException: None.get in D...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19123 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19124: [SPARK-21912][SQL] Creating ORC datasource table should ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19124 **[Test build #81391 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81391/testReport)** for PR 19124 at commit [`808dfe0`](https://github.com/apache/spark/commit/808dfe0fcd9de2f43b33f0d1d084172b5624f2a8). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19123: [SPARK-21418][SQL] NoSuchElementException: None.get in D...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19123 **[Test build #81388 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81388/testReport)** for PR 19123 at commit [`735ca94`](https://github.com/apache/spark/commit/735ca949e042493632d297db23286a8f8f83a6ed). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] Creating ORC datasource table ...
GitHub user dongjoon-hyun opened a pull request: https://github.com/apache/spark/pull/19124 [SPARK-21912][SQL] Creating ORC datasource table should check invalid column names ## What changes were proposed in this pull request? Currently, users meet job abortions while creating ORC data source tables with invalid column names. We had better prevent this by raising **AnalysisException** with a guide to use aliases instead like Paquet data source tables. **BEFORE** ```scala scala> sql("CREATE TABLE orc1 USING ORC AS SELECT 1 `a b`") 17/09/04 13:28:21 ERROR Utils: Aborting task java.lang.IllegalArgumentException: Error: : expected at the position 8 of 'struct' but ' ' is found. 17/09/04 13:28:21 ERROR FileFormatWriter: Job job_20170904132821_0001 aborted. 17/09/04 13:28:21 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) org.apache.spark.SparkException: Task failed while writing rows. ``` **AFTER** ```scala scala> sql("CREATE TABLE orc1 USING ORC AS SELECT 1 `a b`") 17/09/04 13:27:40 ERROR CreateDataSourceTableAsSelectCommand: Failed to write to table orc1 org.apache.spark.sql.AnalysisException: Attribute name "a b" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.; ``` ## How was this patch tested? Pass the Jenkins with a new test case. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dongjoon-hyun/spark SPARK-21912 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19124.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19124 commit 808dfe0fcd9de2f43b33f0d1d084172b5624f2a8 Author: Dongjoon Hyun Date: 2017-09-04T20:46:15Z [SPARK-21912][SQL] Creating ORC datasource table should check invalid column names --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18692: [SPARK-21417][SQL] Infer join conditions using pr...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/18692#discussion_r136868330 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala --- @@ -152,3 +152,71 @@ object EliminateOuterJoin extends Rule[LogicalPlan] with PredicateHelper { if (j.joinType == newJoinType) f else Filter(condition, j.copy(joinType = newJoinType)) } } + +/** + * A rule that uses propagated constraints to infer join conditions. The optimization is applicable + * only to CROSS joins. --- End diff -- Can you also mention the reason why we are restricting this to cross joins only ? ``` For other join types, adding inferred join conditions would potentially shuffle children as child node's partitioning won't satisfying the JOIN node's requirements which otherwise could have. ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16774: [SPARK-19357][ML] Adding parallel model evaluatio...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/16774#discussion_r136868226 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -100,31 +113,53 @@ class CrossValidator @Since("1.2.0") (@Since("1.4.0") override val uid: String) val eval = $(evaluator) val epm = $(estimatorParamMaps) val numModels = epm.length -val metrics = new Array[Double](epm.length) + +// Create execution context based on $(parallelism) +val executionContext = getExecutionContext --- End diff -- In the corresponding PR for PySpark implementation the number of threads is limited by the number of models to be trained (https://github.com/WeichenXu123/spark/blob/be2f3d0ec50db4730c9e3f9a813a4eb96889f5b6/python/pyspark/ml/tuning.py#L261). We might do that for instance by overriding the `getParallelism` method. What do you think about this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18692: [SPARK-21417][SQL] Infer join conditions using propagate...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18692 **[Test build #81390 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81390/testReport)** for PR 18692 at commit [`cfeae46`](https://github.com/apache/spark/commit/cfeae46766a6ccb1b1a0113fe41cdb52b16897f3). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19115: [SPARK-21882][CORE] OutputMetrics doesn't count written ...
Github user markhamstra commented on the issue: https://github.com/apache/spark/pull/19115 And now I see that the title was changed to something more useful. Pardon any offense, the end result of the title changes look good. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19115: [SPARK-21882][CORE] OutputMetrics doesn't count written ...
Github user markhamstra commented on the issue: https://github.com/apache/spark/pull/19115 I realize this PR is now closed, but to follow-up on Saisai's request concerning PR titles, I'll also note that the title of this PR isn't very useful even after the JIRA id and component tag are added. Titles like "fixed foo" or "updated bar" don't really tell reviewers or those looking at the commit logs in the future what the PR is about. The JIRA should tell us _why_ a change or addition is needed, the description in the PR should tell us _what_ was changed or added, and the PR title should give us enough of an idea of what is going on that we don't necessarily have to open the PR or look at the code changes just to see whether it is something that we are even at all interested in. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19111: [SPARK-21801][SPARKR][TEST][WIP] set random seed for pre...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19111 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81389/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19111: [SPARK-21801][SPARKR][TEST][WIP] set random seed for pre...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19111 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19111: [SPARK-21801][SPARKR][TEST][WIP] set random seed for pre...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19111 **[Test build #81389 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81389/testReport)** for PR 19111 at commit [`5d156be`](https://github.com/apache/spark/commit/5d156be92fd3cfe8af30094fd759909ce5455d8f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18975: [SPARK-4131] Support "Writing data into the filesystem f...
Github user tejasapatil commented on the issue: https://github.com/apache/spark/pull/18975 @gatorsmile : Yes. Hive is not 100% atomic as stuff can go wrong between removing old data and renaming staging location. But its superior in these regards: - Hive would output "no data" OR "complete data". Here we can have "no data" OR "incomplete data" OR "complete data". The "incomplete data" part worries me. Staging dir helps achieving "you either see nothing OR everything" behaviour. - The window of "you see nothing" is much bigger here compared to Hive as the output location is cleaned up before execution. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19111: [SPARK-21801][SPARKR][TEST][WIP] set random seed for pre...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19111 **[Test build #81389 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81389/testReport)** for PR 19111 at commit [`5d156be`](https://github.com/apache/spark/commit/5d156be92fd3cfe8af30094fd759909ce5455d8f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19111: [SPARK-21801][SPARKR][TEST][WIP] set random seed for pre...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/19111 jenkins, retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19111: [SPARK-21801][SPARKR][TEST][WIP] set random seed for pre...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/19111 + @shivaram could you do a quick review? given this change I'd love to get some feedback --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19111: [SPARK-21801][SPARKR][TEST][WIP] set random seed for pre...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/19111 Yes, the issue is with random sampling, and this PR should fix all of these. I'm not sure why I haven't seen them much before - they have been around for years - appreciate bringing these up, we should track them with JIRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19111: [SPARK-21801][SPARKR][TEST][WIP] set random seed for pre...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/19111 jenkins, retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19123: [SPARK-21418][SQL] NoSuchElementException: None.get in D...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19123 **[Test build #81388 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81388/testReport)** for PR 19123 at commit [`735ca94`](https://github.com/apache/spark/commit/735ca949e042493632d297db23286a8f8f83a6ed). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19123: [SPARK-21418][SQL] NoSuchElementException: None.g...
GitHub user srowen opened a pull request: https://github.com/apache/spark/pull/19123 [SPARK-21418][SQL] NoSuchElementException: None.get in DataSourceScanExec with sun.io.serialization.extendedDebugInfo=true ## What changes were proposed in this pull request? If no SparkConf is available to Utils.redact, simply don't redact. ## How was this patch tested? Existing tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/srowen/spark SPARK-21418 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19123.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19123 commit 735ca949e042493632d297db23286a8f8f83a6ed Author: Sean Owen Date: 2017-09-04T17:32:00Z Don't fail with NPE in corner case where Utils.redact happens outside active session --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluation for...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19122 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluation for...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19122 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81387/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluation for...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19122 **[Test build #81387 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81387/testReport)** for PR 19122 at commit [`be2f3d0`](https://github.com/apache/spark/commit/be2f3d0ec50db4730c9e3f9a813a4eb96889f5b6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19108: [SPARK-21898][ML] Feature parity for KolmogorovSmirnovTe...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19108 cc @yanboliang Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluation for...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19122 **[Test build #81387 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81387/testReport)** for PR 19122 at commit [`be2f3d0`](https://github.com/apache/spark/commit/be2f3d0ec50db4730c9e3f9a813a4eb96889f5b6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluation for...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19122 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81386/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluation for...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19122 **[Test build #81386 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81386/testReport)** for PR 19122 at commit [`57cf534`](https://github.com/apache/spark/commit/57cf53473e5bfb75095b0e519457dbdc973f3300). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class HasParallelism(Params):` * `class CrossValidator(Estimator, ValidatorParams, HasParallelism, MLReadable, MLWritable):` * `class TrainValidationSplit(Estimator, ValidatorParams, HasParallelism, MLReadable, MLWritable):` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluation for...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19122 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluat...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19122#discussion_r136850665 --- Diff: python/pyspark/ml/tuning.py --- @@ -255,18 +257,23 @@ def _fit(self, dataset): randCol = self.uid + "_rand" df = dataset.select("*", rand(seed).alias(randCol)) metrics = [0.0] * numModels + +pool = ThreadPool(processes=min(self.getParallelism(), numModels)) + for i in range(nFolds): validateLB = i * h validateUB = (i + 1) * h condition = (df[randCol] >= validateLB) & (df[randCol] < validateUB) -validation = df.filter(condition) +validation = df.filter(condition).cache() --- End diff -- Here maybe need a discussion. Currently in pyspark it both do not cache `train dataset` and `validation dataset` but in scala impl it cache both of them. But I prefer cache `validation dataset` but do not cache `train dataset`, because the size of `validation dataset` is only `1/numFolds` of input dataset, it deserve caching otherwise it will scan input dataset again. But the size `train dataset` is `(numFolds - 1)/numFolds` of input dataset. We can directly scan from input dataset to generate the `train dataset` and won't slow down too much. @BryanCutler @MLnick What do you think about it ? Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluation for...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19122 **[Test build #81386 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81386/testReport)** for PR 19122 at commit [`57cf534`](https://github.com/apache/spark/commit/57cf53473e5bfb75095b0e519457dbdc973f3300). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19122: [SPARK-21911][ML][PySpark] Parallel Model Evaluat...
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/19122 [SPARK-21911][ML][PySpark] Parallel Model Evaluation for ML Tuning in PySpark ## What changes were proposed in this pull request? Add parallelism support for ML tuning in pyspark. ## How was this patch tested? Test updated. You can merge this pull request into a Git repository by running: $ git pull https://github.com/WeichenXu123/spark par-ml-tuning-py Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19122.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19122 commit 57cf53473e5bfb75095b0e519457dbdc973f3300 Author: WeichenXu Date: 2017-09-04T16:03:55Z init pr --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16611: [SPARK-17967][SPARK-17878][SQL][PYTHON] Support for arra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16611 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16611: [SPARK-17967][SPARK-17878][SQL][PYTHON] Support for arra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16611 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81385/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16611: [SPARK-17967][SPARK-17878][SQL][PYTHON] Support for arra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16611 **[Test build #81385 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81385/testReport)** for PR 16611 at commit [`4c1a012`](https://github.com/apache/spark/commit/4c1a012e5cad648e81797ec494f44392189560ce). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19117: [SPARK-21904] [SQL] Rename tempTables to tempViews in Se...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19117 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19117: [SPARK-21904] [SQL] Rename tempTables to tempViews in Se...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19117 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81384/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19117: [SPARK-21904] [SQL] Rename tempTables to tempViews in Se...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19117 **[Test build #81384 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81384/testReport)** for PR 19117 at commit [`02815e7`](https://github.com/apache/spark/commit/02815e7faae23a32b04c7af08c826f4428c60f5c). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19119: [SPARK-21845] [SQL] Make codegen fallback of expressions...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19119 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81383/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19119: [SPARK-21845] [SQL] Make codegen fallback of expressions...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19119 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19119: [SPARK-21845] [SQL] Make codegen fallback of expressions...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19119 **[Test build #81383 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81383/testReport)** for PR 19119 at commit [`b96da49`](https://github.com/apache/spark/commit/b96da49aa0893f8bf34da2a2c111499fdbad7b5a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18875: [SPARK-21513][SQL] Allow UDF to_json support converting ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18875 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81382/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18875: [SPARK-21513][SQL] Allow UDF to_json support converting ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18875 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18875: [SPARK-21513][SQL] Allow UDF to_json support converting ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18875 **[Test build #81382 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81382/testReport)** for PR 18875 at commit [`3ebbe67`](https://github.com/apache/spark/commit/3ebbe67e059dfb6a004ff50f3c661f6319d616b8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16611: [SPARK-17967][SPARK-17878][SQL][PYTHON] Support for arra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16611 **[Test build #81385 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81385/testReport)** for PR 16611 at commit [`4c1a012`](https://github.com/apache/spark/commit/4c1a012e5cad648e81797ec494f44392189560ce). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org