[GitHub] spark issue #20330: [SPARK-23121][core] Fix for ui becoming unaccessible for...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20330 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20330: [SPARK-23121][core] Fix for ui becoming unaccessible for...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20330 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86417/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20330: [SPARK-23121][core] Fix for ui becoming unaccessible for...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20330 **[Test build #86417 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86417/testReport)** for PR 20330 at commit [`c733ac9`](https://github.com/apache/spark/commit/c733ac90c29b54c52142f787fbeb91648d8dc698). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20120: [SPARK-22926] [SQL] Respect table-level conf comp...
Github user gatorsmile closed the pull request at: https://github.com/apache/spark/pull/20120 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20087 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20087 LGTM Thanks! Merged to master/2.3 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20336: [SPARK-23165][DOC] Spelling mistake fix in quick-...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20336 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20336: [SPARK-23165][DOC] Spelling mistake fix in quick-start d...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20336 LGTM Thanks! Merged to master/2.3 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20336: [SPARK-23165][DOC] Spelling mistake fix in quick-start d...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20336 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86419/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20336: [SPARK-23165][DOC] Spelling mistake fix in quick-start d...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20336 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20336: [SPARK-23165][DOC] Spelling mistake fix in quick-start d...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20336 **[Test build #86419 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86419/testReport)** for PR 20336 at commit [`a7471e4`](https://github.com/apache/spark/commit/a7471e4acb7d8967fef37a8055e9b329dfbbee04). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20249: [SPARK-23057][SPARK-19235][SQL] SET LOCATION should chan...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20249 **[Test build #86420 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86420/testReport)** for PR 20249 at commit [`90c4980`](https://github.com/apache/spark/commit/90c49809886e2f487dc4c4dc6ba45aa16bae8933). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20087 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86415/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20087 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20087 **[Test build #86415 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86415/testReport)** for PR 20087 at commit [`118f788`](https://github.com/apache/spark/commit/118f7880bdcf26ba7394a2cc7fac2e0eae707d6f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20249: [SPARK-23057][SPARK-19235][SQL] SET LOCATION should chan...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20249 add to whitelist --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20249: [SPARK-23057][SPARK-19235][SQL] SET LOCATION should chan...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20249 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20336: [SPARK-23165][DOC] Spelling mistake fix in quick-start d...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20336 **[Test build #86419 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86419/testReport)** for PR 20336 at commit [`a7471e4`](https://github.com/apache/spark/commit/a7471e4acb7d8967fef37a8055e9b329dfbbee04). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20336: [SPARK-23165][DOC] Spelling mistake fix in quick-start d...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20336 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20333: [SPARK-23087][SQL] CheckCartesianProduct too restrictive...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20333 LGTM except one minor comment. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20333: [SPARK-23087][SQL] CheckCartesianProduct too rest...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20333#discussion_r162794983 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala --- @@ -274,4 +274,18 @@ class DataFrameJoinSuite extends QueryTest with SharedSQLContext { checkAnswer(innerJoin, Row(1) :: Nil) } + test("SPARK-23087: don't throw Analysis Exception in CheckCartesianProduct when join condition " + +"is false or null") { +val df = spark.range(10) --- End diff -- > `withSQLConf(CROSS_JOINS_ENABLED.key -> "true") {` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20333: [SPARK-23087][SQL] CheckCartesianProduct too rest...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20333#discussion_r162794939 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -1108,15 +1108,19 @@ object CheckCartesianProducts extends Rule[LogicalPlan] with PredicateHelper { */ def isCartesianProduct(join: Join): Boolean = { val conditions = join.condition.map(splitConjunctivePredicates).getOrElse(Nil) -!conditions.map(_.references).exists(refs => refs.exists(join.left.outputSet.contains) -&& refs.exists(join.right.outputSet.contains)) + +conditions match { + case Seq(Literal.FalseLiteral) | Seq(Literal(null, BooleanType)) => false + case _ => !conditions.map(_.references).exists(refs => +refs.exists(join.left.outputSet.contains) && refs.exists(join.right.outputSet.contains)) +} } def apply(plan: LogicalPlan): LogicalPlan = if (SQLConf.get.crossJoinEnabled) { plan } else plan transform { - case j @ Join(left, right, Inner | LeftOuter | RightOuter | FullOuter, condition) + case j @ Join(left, right, Inner | LeftOuter | RightOuter | FullOuter, _) --- End diff -- Yeah. For outer join, it makes sense to remove this check --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20333: [SPARK-23087][SQL] CheckCartesianProduct too restrictive...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20333 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/59/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20333: [SPARK-23087][SQL] CheckCartesianProduct too restrictive...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20333 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20333: [SPARK-23087][SQL] CheckCartesianProduct too restrictive...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20333 **[Test build #86418 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86418/testReport)** for PR 20333 at commit [`a4a6ac8`](https://github.com/apache/spark/commit/a4a6ac89e44c743a0471b01a0c499accec71cf73). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20333: [SPARK-23087][SQL] CheckCartesianProduct too rest...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20333#discussion_r162793942 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -1108,15 +1108,19 @@ object CheckCartesianProducts extends Rule[LogicalPlan] with PredicateHelper { */ def isCartesianProduct(join: Join): Boolean = { val conditions = join.condition.map(splitConjunctivePredicates).getOrElse(Nil) -!conditions.map(_.references).exists(refs => refs.exists(join.left.outputSet.contains) -&& refs.exists(join.right.outputSet.contains)) + +conditions match { + case Seq(Literal.FalseLiteral) | Seq(Literal(null, BooleanType)) => false + case _ => !conditions.map(_.references).exists(refs => +refs.exists(join.left.outputSet.contains) && refs.exists(join.right.outputSet.contains)) +} } def apply(plan: LogicalPlan): LogicalPlan = if (SQLConf.get.crossJoinEnabled) { plan } else plan transform { - case j @ Join(left, right, Inner | LeftOuter | RightOuter | FullOuter, condition) + case j @ Join(left, right, Inner | LeftOuter | RightOuter | FullOuter, _) --- End diff -- why are you saying that the size of the result set is the same? If you have a relation A (of size n, let's say 1M rows) in outer join with a relation B (of size m, let's say 1M rows). If the condition is true, the output relation is 1M * 1M (ie. (n * m)); if the condition is false, the result is 1M (n) for a left join, 1M (m) for a right join, 1M + 1M (m +n) for a full outer join. Therefore the size is not the same at all. But maybe you meant something different, am I missing something? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20330: [SPARK-23121][core] Fix for ui becoming unaccessible for...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20330 **[Test build #86417 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86417/testReport)** for PR 20330 at commit [`c733ac9`](https://github.com/apache/spark/commit/c733ac90c29b54c52142f787fbeb91648d8dc698). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20330: [SPARK-23121][core] Fix for ui becoming unaccessi...
Github user smurakozi commented on a diff in the pull request: https://github.com/apache/spark/pull/20330#discussion_r162792383 --- Diff: core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala --- @@ -1002,4 +1000,12 @@ private object ApiHelper { } } + def lastStageNameAndDescription(store: AppStatusStore, job: JobData): (String, String) = { +store.asOption(store.lastStageAttempt(job.stageIds.max)) match { + case Some(lastStageAttempt) => +(lastStageAttempt.name, lastStageAttempt.description.getOrElse(job.name)) + case None => ("", "") --- End diff -- Fixed, thanks for catching. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20208 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86414/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20208 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20208 **[Test build #86414 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86414/testReport)** for PR 20208 at commit [`29c281d`](https://github.com/apache/spark/commit/29c281dbe3c6f63614d9abc286c68e283786649b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20331: [SPARK-23158] [SQL] Move HadoopFsRelationTest test suite...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20331 cc @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20331: [SPARK-23158] [SQL] Move HadoopFsRelationTest test suite...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20331 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20331: [SPARK-23158] [SQL] Move HadoopFsRelationTest test suite...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20331 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86412/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20331: [SPARK-23158] [SQL] Move HadoopFsRelationTest test suite...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20331 **[Test build #86412 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86412/testReport)** for PR 20331 at commit [`9c85b18`](https://github.com/apache/spark/commit/9c85b18c059e4ab3b4b25a5b2e414b4f0c67072f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20208 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20208 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86413/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20208 **[Test build #86413 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86413/testReport)** for PR 20208 at commit [`e1d6f2a`](https://github.com/apache/spark/commit/e1d6f2a5ba0cae28b0ce4ed3612429a593828c0f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20325: [SPARK-22808][DOCS] add insertInto when save hive...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20325#discussion_r162790779 --- Diff: docs/sql-programming-guide.md --- @@ -580,6 +580,9 @@ default local Hive metastore (using Derby) for you. Unlike the `createOrReplaceT Hive metastore. Persistent tables will still exist even after your Spark program has restarted, as long as you maintain your connection to the same metastore. A DataFrame for a persistent table can be created by calling the `table` method on a `SparkSession` with the name of the table. +Notice that for `DataFrames` is built on Hive table, `insertInto` should be used instead of `saveAsTable`. --- End diff -- This limitation is lifted in Spark 2.2. See https://issues.apache.org/jira/browse/SPARK-19152 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19992: [SPARK-22805][CORE] Use StorageLevel aliases in event lo...
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/19992 @squito I think it's fine to just close the PR/JIRA issue. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20336: [SPARK-23165][DOC] Spelling mistake fix in quick-start d...
Github user ashashwat commented on the issue: https://github.com/apache/spark/pull/20336 Retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19993: [SPARK-22799][ML] Bucketizer should throw exception if s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19993 **[Test build #86416 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86416/testReport)** for PR 19993 at commit [`d9d25b0`](https://github.com/apache/spark/commit/d9d25b0f0bcf365366c0c13daf882cbea86d3835). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19993: [SPARK-22799][ML] Bucketizer should throw exception if s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19993 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86416/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19993: [SPARK-22799][ML] Bucketizer should throw exception if s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19993 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20087 **[Test build #86415 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86415/testReport)** for PR 20087 at commit [`118f788`](https://github.com/apache/spark/commit/118f7880bdcf26ba7394a2cc7fac2e0eae707d6f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19993: [SPARK-22799][ML] Bucketizer should throw exception if s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19993 **[Test build #86416 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86416/testReport)** for PR 19993 at commit [`d9d25b0`](https://github.com/apache/spark/commit/d9d25b0f0bcf365366c0c13daf882cbea86d3835). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19993: [SPARK-22799][ML] Bucketizer should throw exception if s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19993 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19993: [SPARK-22799][ML] Bucketizer should throw exception if s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19993 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/58/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20087 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20208 **[Test build #86414 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86414/testReport)** for PR 20208 at commit [`29c281d`](https://github.com/apache/spark/commit/29c281dbe3c6f63614d9abc286c68e283786649b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20208 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/57/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20208 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20208 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20208 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/56/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20208 **[Test build #86413 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86413/testReport)** for PR 20208 at commit [`e1d6f2a`](https://github.com/apache/spark/commit/e1d6f2a5ba0cae28b0ce4ed3612429a593828c0f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20208#discussion_r162786270 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaEvolutionTest.scala --- @@ -0,0 +1,436 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import java.io.File + +import org.apache.spark.sql.{QueryTest, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils} + +/** + * Schema can evolve in several ways and the followings are supported in file-based data sources. + * + * 1. Add a column + * 2. Remove a column + * 3. Change a column position + * 4. Change a column type + * + * Here, we consider safe evolution without data loss. For example, data type evolution should be + * from small types to larger types like `int`-to-`long`, not vice versa. + * + * So far, file-based data sources have schema evolution coverages like the followings. + * + * | File Format | Coverage | Note | + * | | | -- | + * | TEXT | N/A | Schema consists of a single string column. | + * | CSV | 1, 2, 4 | | + * | JSON | 1, 2, 3, 4 | | + * | ORC | 1, 2, 3, 4 | Native vectorized ORC reader has the widest coverage. | + * | PARQUET | 1, 2, 3 | | + * + * This aims to provide an explicit test coverage for schema evolution on file-based data sources. + * Since a file format has its own coverage of schema evolution, we need a test suite + * for each file-based data source with corresponding supported test case traits. + * + * The following is a hierarchy of test traits. + * + * SchemaEvolutionTest + * -> AddColumnEvolutionTest + * -> RemoveColumnEvolutionTest + * -> ChangePositionEvolutionTest + * -> BooleanTypeEvolutionTest + * -> IntegralTypeEvolutionTest + * -> ToDoubleTypeEvolutionTest + * -> ToDecimalTypeEvolutionTest + */ + +trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with SharedSQLContext { + val format: String + val options: Map[String, String] = Map.empty[String, String] +} + +/** + * Add column. + * This test suite assumes that the missing column should be `null`. + */ +trait AddColumnEvolutionTest extends SchemaEvolutionTest { + import testImplicits._ + + test("append column at the end") { +withTempDir { dir => + val path = dir.getCanonicalPath + + val df1 = Seq("a", "b").toDF("col1") + val df2 = df1.withColumn("col2", lit("x")) + val df3 = df2.withColumn("col3", lit("y")) + + val dir1 = s"$path${File.separator}part=one" + val dir2 = s"$path${File.separator}part=two" + val dir3 = s"$path${File.separator}part=three" + + df1.write.mode("overwrite").format(format).options(options).save(dir1) + df2.write.mode("overwrite").format(format).options(options).save(dir2) + df3.write.mode("overwrite").format(format).options(options).save(dir3) + + val df = spark.read +.schema(df3.schema) +.format(format) +.options(options) +.load(path) + + checkAnswer(df, Seq( +Row("a", null, null, "one"), +Row("b", null, null, "one"), +Row("a", "x", null, "two"), +Row("b", "x", null, "two"), +Row("a", "x", "y", "three"), +Row("b", "x", "y", "three"))) +} +
[GitHub] spark issue #20331: [SPARK-23158] [SQL] Move HadoopFsRelationTest test suite...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20331 **[Test build #86412 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86412/testReport)** for PR 20331 at commit [`9c85b18`](https://github.com/apache/spark/commit/9c85b18c059e4ab3b4b25a5b2e414b4f0c67072f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20331: [SPARK-23158] [SQL] Move HadoopFsRelationTest test suite...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20331 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20331: [SPARK-23158] [SQL] Move HadoopFsRelationTest test suite...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20331 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/55/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/20208 Thank you for review, @HyukjinKwon . I'll update like that. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20204: [SPARK-7721][PYTHON][TESTS] Adds PySpark coverage genera...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20204 Will merge this one if there's no more comments in few days. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20204: [SPARK-7721][PYTHON][TESTS] Adds PySpark coverage...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20204#discussion_r162785336 --- Diff: python/run-tests-with-coverage --- @@ -0,0 +1,69 @@ +#!/usr/bin/env bash + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +set -o pipefail +set -e + +# This variable indicates which coverage executable to run to combine coverages +# and generate HTMLs, for example, 'coverage3' in Python 3. +COV_EXEC="${COV_EXEC:-coverage}" +FWDIR="$(cd "`dirname $0`"; pwd)" +pushd "$FWDIR" > /dev/null + +# Ensure that coverage executable is installed. +if ! hash $COV_EXEC 2>/dev/null; then + echo "Missing coverage executable in your path, skipping PySpark coverage" + exit 1 +fi + +# Set up the directories for coverage results. +export COVERAGE_DIR="$FWDIR/test_coverage" +rm -fr "$COVERAGE_DIR/coverage_data" +rm -fr "$COVERAGE_DIR/htmlcov" +mkdir -p "$COVERAGE_DIR/coverage_data" + +# Current directory are added in the python path so that it doesn't refer our built +# pyspark zip library first. +export PYTHONPATH="$FWDIR:$PYTHONPATH" +# Also, our sitecustomize.py and coverage_daemon.py are included in the path. +export PYTHONPATH="$COVERAGE_DIR:$PYTHONPATH" + +# We use 'spark.python.daemon.module' configuration to insert the coverage supported workers. +export SPARK_CONF_DIR="$COVERAGE_DIR/conf" + +# This environment variable enables the coverage. +export COVERAGE_PROCESS_START="$FWDIR/.coveragerc" + +# If you'd like to run a specific unittest class, you could do such as +# SPARK_TESTING=1 ../bin/pyspark pyspark.sql.tests VectorizedUDFTests +./run-tests "$@" --- End diff -- Another tip is, if we use `../bin/pyspark` here, do some simple tests and then exit, it looks still producing the coverage correctly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20336: [SPARK-23165][DOC] Spelling mistake fix in quick-start d...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20336 **[Test build #4069 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4069/testReport)** for PR 20336 at commit [`785fccf`](https://github.com/apache/spark/commit/785fccff1c35f93fc479d460b527bbb6fcfc00a7). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20336: [SPARK-23165][DOC] Spelling mistake fix in quick-start d...
Github user ashashwat commented on the issue: https://github.com/apache/spark/pull/20336 @srowen Let me go ahead and do that. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20336: [SPARK-23165][DOC] Spelling mistake fix in quick-start d...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20336 **[Test build #4069 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4069/testReport)** for PR 20336 at commit [`785fccf`](https://github.com/apache/spark/commit/785fccff1c35f93fc479d460b527bbb6fcfc00a7). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20336: [SPARK-23165][DOC] Spelling mistake fix in quick-start d...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20336 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20336: [SPARK-23165][DOC] Spelling mistake fix in quick-start d...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20336 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20336: [SPARK-23165][DOC] Spelling mistake fix in quick-...
GitHub user ashashwat opened a pull request: https://github.com/apache/spark/pull/20336 [SPARK-23165][DOC] Spelling mistake fix in quick-start doc. ## What changes were proposed in this pull request? Fix spelling in quick-start doc. ## How was this patch tested? Doc only. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ashashwat/spark SPARK-23165 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20336.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20336 commit 785fccff1c35f93fc479d460b527bbb6fcfc00a7 Author: Shashwat Anand Date: 2018-01-20T14:50:44Z [SPARK-23165][DOC] Spelling mistake fix in quick-start doc. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20335: [SPARK-23088][CORE] History server not showing incomplet...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20335 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20335: [SPARK-23088][CORE] History server not showing incomplet...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20335 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20335: [SPARK-23088][CORE] History server not showing in...
GitHub user pmackles opened a pull request: https://github.com/apache/spark/pull/20335 [SPARK-23088][CORE] History server not showing incomplete/running applications ## What changes were proposed in this pull request? History server not showing incomplete/running applications when spark.history.ui.maxApplications property is set to a value that is smaller than the total number of applications. ## How was this patch tested? Verified manually against master and 2.2.2 branch. You can merge this pull request into a Git repository by running: $ git pull https://github.com/pmackles/spark SPARK-23088 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20335.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20335 commit d94042d81d9f982ee58aab1b3296d33b10d50a75 Author: Paul Mackles Date: 2018-01-20T13:53:46Z [SPARK-23088][CORE] History server not showing incomplete/running applications --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20295 To me, seems roughly fine. > Alternatively, we can implement a new serialization protocol for GROUP_MAP eval type, i.e, instead of sending an arrow batch, we could send a group row and then an arrow batch. I don't have a strong preference on this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18277: [SPARK-20947][PYTHON] Fix encoding/decoding error...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18277#discussion_r162782791 --- Diff: python/pyspark/rdd.py --- @@ -751,7 +751,7 @@ def func(iterator): def pipe_objs(out): for obj in iterator: -s = str(obj).rstrip('\n') + '\n' +s = unicode(obj).rstrip('\n') + '\n' --- End diff -- @chaoslawful, if you are active, we could change `\n` to `u\n` to reduce the conversion and don't rely on the implicit conversion between `str` and `unicode`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18277: [SPARK-20947][PYTHON] Fix encoding/decoding error in pip...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18277 Let me merge this one in few days if there's no more comments. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20087 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20087 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86411/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20087 **[Test build #86411 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86411/testReport)** for PR 20087 at commit [`118f788`](https://github.com/apache/spark/commit/118f7880bdcf26ba7394a2cc7fac2e0eae707d6f). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20334: How to check registered table name.
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20334 Hey @AtulKumVerma, questions should go to mailing list usually. See http://spark.apache.org/community.html. I believe you can have a better answer from there. Pull request from a branch to another branch actually causes a slight visual problem. Mind closing this pull request please? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20208 cc @sameeragarwal for reviewing too. I vaguely remember we had a talk about this before. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20208#discussion_r162781286 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaEvolutionTest.scala --- @@ -0,0 +1,436 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import java.io.File + +import org.apache.spark.sql.{QueryTest, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils} + +/** + * Schema can evolve in several ways and the followings are supported in file-based data sources. + * + * 1. Add a column + * 2. Remove a column + * 3. Change a column position + * 4. Change a column type + * + * Here, we consider safe evolution without data loss. For example, data type evolution should be + * from small types to larger types like `int`-to-`long`, not vice versa. + * + * So far, file-based data sources have schema evolution coverages like the followings. + * + * | File Format | Coverage | Note | + * | | | -- | + * | TEXT | N/A | Schema consists of a single string column. | + * | CSV | 1, 2, 4 | | + * | JSON | 1, 2, 3, 4 | | + * | ORC | 1, 2, 3, 4 | Native vectorized ORC reader has the widest coverage. | + * | PARQUET | 1, 2, 3 | | + * + * This aims to provide an explicit test coverage for schema evolution on file-based data sources. + * Since a file format has its own coverage of schema evolution, we need a test suite + * for each file-based data source with corresponding supported test case traits. + * + * The following is a hierarchy of test traits. + * + * SchemaEvolutionTest + * -> AddColumnEvolutionTest + * -> RemoveColumnEvolutionTest + * -> ChangePositionEvolutionTest + * -> BooleanTypeEvolutionTest + * -> IntegralTypeEvolutionTest + * -> ToDoubleTypeEvolutionTest + * -> ToDecimalTypeEvolutionTest + */ + +trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with SharedSQLContext { + val format: String + val options: Map[String, String] = Map.empty[String, String] +} + +/** + * Add column. + * This test suite assumes that the missing column should be `null`. + */ +trait AddColumnEvolutionTest extends SchemaEvolutionTest { + import testImplicits._ + + test("append column at the end") { +withTempDir { dir => + val path = dir.getCanonicalPath + + val df1 = Seq("a", "b").toDF("col1") + val df2 = df1.withColumn("col2", lit("x")) + val df3 = df2.withColumn("col3", lit("y")) + + val dir1 = s"$path${File.separator}part=one" + val dir2 = s"$path${File.separator}part=two" + val dir3 = s"$path${File.separator}part=three" + + df1.write.mode("overwrite").format(format).options(options).save(dir1) + df2.write.mode("overwrite").format(format).options(options).save(dir2) + df3.write.mode("overwrite").format(format).options(options).save(dir3) + + val df = spark.read +.schema(df3.schema) +.format(format) +.options(options) +.load(path) + + checkAnswer(df, Seq( +Row("a", null, null, "one"), +Row("b", null, null, "one"), +Row("a", "x", null, "two"), +Row("b", "x", null, "two"), +Row("a", "x", "y", "three"), +Row("b", "x", "y", "three"))) +} + }
[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20208#discussion_r162781551 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaEvolutionTest.scala --- @@ -0,0 +1,436 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import java.io.File + +import org.apache.spark.sql.{QueryTest, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils} + +/** + * Schema can evolve in several ways and the followings are supported in file-based data sources. + * + * 1. Add a column + * 2. Remove a column + * 3. Change a column position + * 4. Change a column type + * + * Here, we consider safe evolution without data loss. For example, data type evolution should be + * from small types to larger types like `int`-to-`long`, not vice versa. + * + * So far, file-based data sources have schema evolution coverages like the followings. + * + * | File Format | Coverage | Note | + * | | | -- | + * | TEXT | N/A | Schema consists of a single string column. | + * | CSV | 1, 2, 4 | | + * | JSON | 1, 2, 3, 4 | | + * | ORC | 1, 2, 3, 4 | Native vectorized ORC reader has the widest coverage. | + * | PARQUET | 1, 2, 3 | | + * + * This aims to provide an explicit test coverage for schema evolution on file-based data sources. + * Since a file format has its own coverage of schema evolution, we need a test suite + * for each file-based data source with corresponding supported test case traits. + * + * The following is a hierarchy of test traits. + * + * SchemaEvolutionTest + * -> AddColumnEvolutionTest + * -> RemoveColumnEvolutionTest + * -> ChangePositionEvolutionTest + * -> BooleanTypeEvolutionTest + * -> IntegralTypeEvolutionTest + * -> ToDoubleTypeEvolutionTest + * -> ToDecimalTypeEvolutionTest + */ + +trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with SharedSQLContext { + val format: String + val options: Map[String, String] = Map.empty[String, String] +} + +/** + * Add column. --- End diff -- Shall we leave the number given above in this comment like `(case 1.)`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20208#discussion_r162781325 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaEvolutionTest.scala --- @@ -0,0 +1,436 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import java.io.File + +import org.apache.spark.sql.{QueryTest, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils} + +/** + * Schema can evolve in several ways and the followings are supported in file-based data sources. + * + * 1. Add a column + * 2. Remove a column + * 3. Change a column position + * 4. Change a column type + * + * Here, we consider safe evolution without data loss. For example, data type evolution should be + * from small types to larger types like `int`-to-`long`, not vice versa. + * + * So far, file-based data sources have schema evolution coverages like the followings. + * + * | File Format | Coverage | Note | + * | | | -- | + * | TEXT | N/A | Schema consists of a single string column. | + * | CSV | 1, 2, 4 | | + * | JSON | 1, 2, 3, 4 | | + * | ORC | 1, 2, 3, 4 | Native vectorized ORC reader has the widest coverage. | + * | PARQUET | 1, 2, 3 | | + * + * This aims to provide an explicit test coverage for schema evolution on file-based data sources. + * Since a file format has its own coverage of schema evolution, we need a test suite + * for each file-based data source with corresponding supported test case traits. + * + * The following is a hierarchy of test traits. + * + * SchemaEvolutionTest + * -> AddColumnEvolutionTest + * -> RemoveColumnEvolutionTest + * -> ChangePositionEvolutionTest + * -> BooleanTypeEvolutionTest + * -> IntegralTypeEvolutionTest + * -> ToDoubleTypeEvolutionTest + * -> ToDecimalTypeEvolutionTest + */ + +trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with SharedSQLContext { + val format: String + val options: Map[String, String] = Map.empty[String, String] +} + +/** + * Add column. + * This test suite assumes that the missing column should be `null`. + */ +trait AddColumnEvolutionTest extends SchemaEvolutionTest { + import testImplicits._ + + test("append column at the end") { +withTempDir { dir => + val path = dir.getCanonicalPath + + val df1 = Seq("a", "b").toDF("col1") + val df2 = df1.withColumn("col2", lit("x")) + val df3 = df2.withColumn("col3", lit("y")) + + val dir1 = s"$path${File.separator}part=one" + val dir2 = s"$path${File.separator}part=two" + val dir3 = s"$path${File.separator}part=three" + + df1.write.mode("overwrite").format(format).options(options).save(dir1) + df2.write.mode("overwrite").format(format).options(options).save(dir2) + df3.write.mode("overwrite").format(format).options(options).save(dir3) + + val df = spark.read +.schema(df3.schema) +.format(format) +.options(options) +.load(path) + + checkAnswer(df, Seq( +Row("a", null, null, "one"), +Row("b", null, null, "one"), +Row("a", "x", null, "two"), +Row("b", "x", null, "two"), +Row("a", "x", "y", "three"), +Row("b", "x", "y", "three"))) +} + }
[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20208#discussion_r162781308 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaEvolutionTest.scala --- @@ -0,0 +1,436 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import java.io.File + +import org.apache.spark.sql.{QueryTest, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils} + +/** + * Schema can evolve in several ways and the followings are supported in file-based data sources. + * + * 1. Add a column + * 2. Remove a column + * 3. Change a column position + * 4. Change a column type + * + * Here, we consider safe evolution without data loss. For example, data type evolution should be + * from small types to larger types like `int`-to-`long`, not vice versa. + * + * So far, file-based data sources have schema evolution coverages like the followings. + * + * | File Format | Coverage | Note | + * | | | -- | + * | TEXT | N/A | Schema consists of a single string column. | + * | CSV | 1, 2, 4 | | + * | JSON | 1, 2, 3, 4 | | + * | ORC | 1, 2, 3, 4 | Native vectorized ORC reader has the widest coverage. | + * | PARQUET | 1, 2, 3 | | + * + * This aims to provide an explicit test coverage for schema evolution on file-based data sources. + * Since a file format has its own coverage of schema evolution, we need a test suite + * for each file-based data source with corresponding supported test case traits. + * + * The following is a hierarchy of test traits. + * + * SchemaEvolutionTest + * -> AddColumnEvolutionTest + * -> RemoveColumnEvolutionTest + * -> ChangePositionEvolutionTest + * -> BooleanTypeEvolutionTest + * -> IntegralTypeEvolutionTest + * -> ToDoubleTypeEvolutionTest + * -> ToDecimalTypeEvolutionTest + */ + +trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with SharedSQLContext { + val format: String + val options: Map[String, String] = Map.empty[String, String] +} + +/** + * Add column. + * This test suite assumes that the missing column should be `null`. + */ +trait AddColumnEvolutionTest extends SchemaEvolutionTest { + import testImplicits._ + + test("append column at the end") { +withTempDir { dir => + val path = dir.getCanonicalPath + + val df1 = Seq("a", "b").toDF("col1") + val df2 = df1.withColumn("col2", lit("x")) + val df3 = df2.withColumn("col3", lit("y")) + + val dir1 = s"$path${File.separator}part=one" + val dir2 = s"$path${File.separator}part=two" + val dir3 = s"$path${File.separator}part=three" + + df1.write.mode("overwrite").format(format).options(options).save(dir1) + df2.write.mode("overwrite").format(format).options(options).save(dir2) + df3.write.mode("overwrite").format(format).options(options).save(dir3) + + val df = spark.read +.schema(df3.schema) +.format(format) +.options(options) +.load(path) + + checkAnswer(df, Seq( +Row("a", null, null, "one"), +Row("b", null, null, "one"), +Row("a", "x", null, "two"), +Row("b", "x", null, "two"), +Row("a", "x", "y", "three"), +Row("b", "x", "y", "three"))) +} + }
[GitHub] spark issue #20334: How to check registered table name.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20334 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20334: How to check registered table name.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20334 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20334: How to check registered table name.
GitHub user AtulKumVerma opened a pull request: https://github.com/apache/spark/pull/20334 How to check registered table name. Dear fellows, I want to know how can i see all the registered dataset or dataframe as temporary table or view in sql context. I read about it catalyst is responsible for maintaing one to one mapping between dataframe and its temporary table name. I want to just list down that all from catalyst. Your Response highly Appreciate. Thanks All in Advance. You can merge this pull request into a Git repository by running: $ git pull https://github.com/apache/spark branch-2.3 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20334.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20334 commit 5244aafc2d7945c11c96398b8d5b752b45fd148c Author: Xianjin YE Date: 2018-01-02T15:30:38Z [SPARK-22897][CORE] Expose stageAttemptId in TaskContext ## What changes were proposed in this pull request? stageAttemptId added in TaskContext and corresponding construction modification ## How was this patch tested? Added a new test in TaskContextSuite, two cases are tested: 1. Normal case without failure 2. Exception case with resubmitted stages Link to [SPARK-22897](https://issues.apache.org/jira/browse/SPARK-22897) Author: Xianjin YE Closes #20082 from advancedxy/SPARK-22897. (cherry picked from commit a6fc300e91273230e7134ac6db95ccb4436c6f8f) Signed-off-by: Wenchen Fan commit b96a2132413937c013e1099be3ec4bc420c947fd Author: Juliusz Sompolski Date: 2018-01-03T13:40:51Z [SPARK-22938] Assert that SQLConf.get is accessed only on the driver. ## What changes were proposed in this pull request? Assert if code tries to access SQLConf.get on executor. This can lead to hard to detect bugs, where the executor will read fallbackConf, falling back to default config values, ignoring potentially changed non-default configs. If a config is to be passed to executor code, it needs to be read on the driver, and passed explicitly. ## How was this patch tested? Check in existing tests. Author: Juliusz Sompolski Closes #20136 from juliuszsompolski/SPARK-22938. (cherry picked from commit 247a08939d58405aef39b2a4e7773aa45474ad12) Signed-off-by: Wenchen Fan commit a05e85ecb76091567a26a3a14ad0879b4728addc Author: gatorsmile Date: 2018-01-03T14:09:30Z [SPARK-22934][SQL] Make optional clauses order insensitive for CREATE TABLE SQL statement ## What changes were proposed in this pull request? Currently, our CREATE TABLE syntax require the EXACT order of clauses. It is pretty hard to remember the exact order. Thus, this PR is to make optional clauses order insensitive for `CREATE TABLE` SQL statement. ``` CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name [(col_name1 col_type1 [COMMENT col_comment1], ...)] USING datasource [OPTIONS (key1=val1, key2=val2, ...)] [PARTITIONED BY (col_name1, col_name2, ...)] [CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS] [LOCATION path] [COMMENT table_comment] [TBLPROPERTIES (key1=val1, key2=val2, ...)] [AS select_statement] ``` The proposal is to make the following clauses order insensitive. ``` [OPTIONS (key1=val1, key2=val2, ...)] [PARTITIONED BY (col_name1, col_name2, ...)] [CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS] [LOCATION path] [COMMENT table_comment] [TBLPROPERTIES (key1=val1, key2=val2, ...)] ``` The same idea is also applicable to Create Hive Table. ``` CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name [(col_name1[:] col_type1 [COMMENT col_comment1], ...)] [COMMENT table_comment] [PARTITIONED BY (col_name2[:] col_type2 [COMMENT col_comment2], ...)] [ROW FORMAT row_format] [STORED AS file_format] [LOCATION path] [TBLPROPERTIES (key1=val1, key2=val2, ...)] [AS select_statement] ``` The proposal is to make the following clauses order insensitive. ``` [COMMENT table_comment] [PARTITIONED BY (col_name2[:] col_type2 [COMMENT col_comment2], ...)] [ROW FORMAT row_format] [STORED AS file_format] [LOCATION path] [TBLPROPERTIES (key1=val1, key2=val2, ...)] ``` ## How was this patch tested? Added test cases Author: gatorsmile Closes #20133 from gatorsmile/createDataSourceTableDDL. (cherry picked from commit 1a87a1609c4d2c9027a2cf669ea3337b89f61fb6) Signed-off-by: gatorsmile commit b96
[GitHub] spark pull request #20325: [SPARK-22808][DOCS] add insertInto when save hive...
Github user brandonJY commented on a diff in the pull request: https://github.com/apache/spark/pull/20325#discussion_r162781167 --- Diff: docs/sql-programming-guide.md --- @@ -580,6 +580,9 @@ default local Hive metastore (using Derby) for you. Unlike the `createOrReplaceT Hive metastore. Persistent tables will still exist even after your Spark program has restarted, as long as you maintain your connection to the same metastore. A DataFrame for a persistent table can be created by calling the `table` method on a `SparkSession` with the name of the table. +Notice that for `DataFrames` is built on Hive table, `insertInto` should be used instead of `saveAsTable`. --- End diff -- @gatorsmile Could you elaborate on your comment? The purpose of this sentence was to warn user to use `insertInto` when they are dealing DataFrames that created from Hive table. Since due to https://issues.apache.org/jira/browse/SPARK-16803, `saveAsTable` will not work on that special case. Or do you have any suggestions to make it more clear? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18906: [SPARK-21692][PYSPARK][SQL] Add nullability suppo...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18906#discussion_r162779803 --- Diff: python/pyspark/sql/tests.py --- @@ -597,10 +597,29 @@ def test_non_existed_udf(self): self.assertRaisesRegexp(AnalysisException, "Can not load class non_existed_udf", lambda: spark.udf.registerJavaFunction("udf1", "non_existed_udf")) -# This is to check if a deprecated 'SQLContext.registerJavaFunction' can call its alias. --- End diff -- Seems this test is gone ... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20087 **[Test build #86411 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86411/testReport)** for PR 20087 at commit [`118f788`](https://github.com/apache/spark/commit/118f7880bdcf26ba7394a2cc7fac2e0eae707d6f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...
Github user fjh100456 commented on a diff in the pull request: https://github.com/apache/spark/pull/20087#discussion_r162779218 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/CompressionCodecSuite.scala --- @@ -0,0 +1,354 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import java.io.File + +import scala.collection.JavaConverters._ + +import org.apache.hadoop.fs.Path +import org.apache.orc.OrcConf.COMPRESS +import org.apache.parquet.hadoop.ParquetOutputFormat +import org.scalatest.BeforeAndAfterAll + +import org.apache.spark.sql.execution.datasources.orc.OrcOptions +import org.apache.spark.sql.execution.datasources.parquet.{ParquetOptions, ParquetTest} +import org.apache.spark.sql.hive.orc.OrcFileOperator +import org.apache.spark.sql.hive.test.TestHiveSingleton +import org.apache.spark.sql.internal.SQLConf + +class CompressionCodecSuite extends TestHiveSingleton with ParquetTest with BeforeAndAfterAll { + import spark.implicits._ + + override def beforeAll(): Unit = { +super.beforeAll() +(0 until maxRecordNum).toDF("a").createOrReplaceTempView("table_source") + } + + override def afterAll(): Unit = { +try { + spark.catalog.dropTempView("table_source") +} finally { + super.afterAll() +} + } + + private val maxRecordNum = 50 + + private def getConvertMetastoreConfName(format: String): String = format.toLowerCase match { +case "parquet" => HiveUtils.CONVERT_METASTORE_PARQUET.key +case "orc" => HiveUtils.CONVERT_METASTORE_ORC.key + } + + private def getSparkCompressionConfName(format: String): String = format.toLowerCase match { +case "parquet" => SQLConf.PARQUET_COMPRESSION.key +case "orc" => SQLConf.ORC_COMPRESSION.key + } + + private def getHiveCompressPropName(format: String): String = format.toLowerCase match { +case "parquet" => ParquetOutputFormat.COMPRESSION +case "orc" => COMPRESS.getAttribute + } + + private def normalizeCodecName(format: String, name: String): String = { +format.toLowerCase match { + case "parquet" => ParquetOptions.getParquetCompressionCodecName(name) + case "orc" => OrcOptions.getORCCompressionCodecName(name) +} + } + + private def getTableCompressionCodec(path: String, format: String): Seq[String] = { +val hadoopConf = spark.sessionState.newHadoopConf() +val codecs = format.toLowerCase match { + case "parquet" => for { +footer <- readAllFootersWithoutSummaryFiles(new Path(path), hadoopConf) +block <- footer.getParquetMetadata.getBlocks.asScala +column <- block.getColumns.asScala + } yield column.getCodec.name() + case "orc" => new File(path).listFiles().filter { file => +file.isFile && !file.getName.endsWith(".crc") && file.getName != "_SUCCESS" + }.map { orcFile => + OrcFileOperator.getFileReader(orcFile.toPath.toString).get.getCompression.toString + }.toSeq +} +codecs.distinct + } + + private def createTable( + rootDir: File, + tableName: String, + isPartitioned: Boolean, + format: String, + compressionCodec: Option[String]): Unit = { +val tblProperties = compressionCodec match { + case Some(prop) => s"TBLPROPERTIES('${getHiveCompressPropName(format)}'='$prop')" + case _ => "" +} +val partitionCreate = if (isPartitioned) "PARTITIONED BY (p string)" else "" +sql( + s""" +|CREATE TABLE $tableName(a int) +|$partitionCreate +|STORED AS $format +|LOCATION '${rootDir.toURI.toString.stripSuffix("/")}/$tableName' +|$tblProperties + """.stripMargin) + } + + private def writeDataToTable( + tableName: String, + partitionVa
[GitHub] spark issue #20203: [SPARK-22577] [core] executor page blacklist status shou...
Github user attilapiros commented on the issue: https://github.com/apache/spark/pull/20203 One more reason to run tests in sbt / maven. In intelliJ somehow the complete suite was successful. But the current failure seems to me unrelated, as org.apache.spark.deploy.history has 0 failures. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20091: [SPARK-22465][FOLLOWUP] Update the number of part...
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/20091#discussion_r162778292 --- Diff: core/src/test/scala/org/apache/spark/rdd/PairRDDFunctionsSuite.scala --- @@ -332,6 +331,48 @@ class PairRDDFunctionsSuite extends SparkFunSuite with SharedSparkContext { assert(joined.getNumPartitions == rdd2.getNumPartitions) } + test("cogroup between multiple RDD when defaultParallelism is set without proper partitioner") { +assert(!sc.conf.contains("spark.default.parallelism")) +try { + sc.conf.set("spark.default.parallelism", "4") + val rdd1 = sc.parallelize((1 to 1000).map(x => (x, x)), 20) + val rdd2 = sc.parallelize(Array((1, 1), (1, 2), (2, 1), (3, 1)), 10) + val joined = rdd1.cogroup(rdd2) + assert(joined.getNumPartitions == sc.defaultParallelism) +} finally { + sc.conf.remove("spark.default.parallelism") +} + } + + test("cogroup between multiple RDD when defaultParallelism is set with proper partitioner") { +assert(!sc.conf.contains("spark.default.parallelism")) +try { + sc.conf.set("spark.default.parallelism", "4") + val rdd1 = sc.parallelize((1 to 1000).map(x => (x, x)), 20) + val rdd2 = sc.parallelize(Array((1, 1), (1, 2), (2, 1), (3, 1))) +.partitionBy(new HashPartitioner(10)) + val joined = rdd1.cogroup(rdd2) + assert(joined.getNumPartitions == rdd2.getNumPartitions) +} finally { + sc.conf.remove("spark.default.parallelism") +} + } + + test("cogroup between multiple RDD when defaultParallelism is set with huge number of " + --- End diff -- nit: "set; with huge number of partitions in upstream RDDs" --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20091: [SPARK-22465][FOLLOWUP] Update the number of part...
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/20091#discussion_r162778240 --- Diff: core/src/test/scala/org/apache/spark/PartitioningSuite.scala --- @@ -284,7 +284,38 @@ class PartitioningSuite extends SparkFunSuite with SharedSparkContext with Priva assert(partitioner3.numPartitions == rdd3.getNumPartitions) assert(partitioner4.numPartitions == rdd3.getNumPartitions) assert(partitioner5.numPartitions == rdd4.getNumPartitions) + } + test("defaultPartitioner when defaultParallelism is set") { +assert(!sc.conf.contains("spark.default.parallelism")) +try { + sc.conf.set("spark.default.parallelism", "4") + + val rdd1 = sc.parallelize((1 to 1000).map(x => (x, x)), 150) + val rdd2 = sc.parallelize(Array((1, 2), (2, 3), (2, 4), (3, 4))) +.partitionBy(new HashPartitioner(10)) + val rdd3 = sc.parallelize(Array((1, 6), (7, 8), (3, 10), (5, 12), (13, 14))) +.partitionBy(new HashPartitioner(100)) + val rdd4 = sc.parallelize(Array((1, 2), (2, 3), (2, 4), (3, 4))) +.partitionBy(new HashPartitioner(9)) + val rdd5 = sc.parallelize((1 to 10).map(x => (x, x)), 11) --- End diff -- Can we add a case where partitioner is not used and default (from spark.default.parallelism) gets used ? For example, something like the following pseudo ``` val rdd6 = sc.parallelize(Array((1, 2), (2, 3), (2, 4), (3, 4))).partitionBy(new HashPartitioner(3)) ... Partitioner.defaultPartitioner(rdd1, rdd6).numPartitions == sc.conf.get("spark.default.parallelism").toInt ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20091: [SPARK-22465][FOLLOWUP] Update the number of part...
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/20091#discussion_r162778187 --- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala --- @@ -43,17 +43,19 @@ object Partitioner { /** * Choose a partitioner to use for a cogroup-like operation between a number of RDDs. * - * If any of the RDDs already has a partitioner, and the number of partitions of the - * partitioner is either greater than or is less than and within a single order of - * magnitude of the max number of upstream partitions, choose that one. + * If spark.default.parallelism is set, we'll use the value of SparkContext defaultParallelism + * as the default partitions number, otherwise we'll use the max number of upstream partitions. * - * Otherwise, we use a default HashPartitioner. For the number of partitions, if - * spark.default.parallelism is set, then we'll use the value from SparkContext - * defaultParallelism, otherwise we'll use the max number of upstream partitions. + * If any of the RDDs already has a partitioner, and the partitioner is an eligible one (with a + * partitions number that is not less than the max number of upstream partitions by an order of + * magnitude), or the number of partitions is larger than the default one, we'll choose the + * exsiting partitioner. --- End diff -- We should rephrase this for clarity. How about "When available, we choose the partitioner from rdds with maximum number of partitions. If this partitioner is eligible (number of partitions within an order of maximum number of partitions in rdds), or has partition number higher than default partitions number - we use this partitioner" --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20146: [SPARK-11215][ML] Add multiple columns support to...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20146#discussion_r162777500 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala --- @@ -249,6 +249,16 @@ object ParamValidators { def arrayLengthGt[T](lowerBound: Double): Array[T] => Boolean = { (value: Array[T]) => value.length > lowerBound } + + /** Check if more than one param in a set of exclusive params are set. */ + def checkExclusiveParams(model: Params, params: String*): Unit = { +if (params.filter(paramName => model.hasParam(paramName) && --- End diff -- The purpose of this method is to check if more than one Params are set among some exclusive Params within a Model. Is it useful to put an irrelevant Param into the exclusive Params to check? As we already know what Params the model has, it sounds like we want to check an irrelevant Param that we already know non-existing? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20087 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86409/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20203: [SPARK-22577] [core] executor page blacklist status shou...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20203 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20203: [SPARK-22577] [core] executor page blacklist status shou...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20203 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86408/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20087: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20087 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19528: [SPARK-20393][WEBU UI][1.6] Strengthen Spark to prevent ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19528 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org