[GitHub] spark issue #20720: [SPARK-23570] [SQL] Add Spark 2.3.0 in HiveExternalCatal...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20720 **[Test build #87900 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87900/testReport)** for PR 20720 at commit [`e99e425`](https://github.com/apache/spark/commit/e99e42516c8255d07990ff05f0bfadc4fa8c4244). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20720: [SPARK-23570] [SQL] Add Spark 2.3.0 in HiveExternalCatal...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20720 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20722: [SPARK-23571][K8S] Delete auxiliary Kubernetes re...
Github user liyinan926 commented on a diff in the pull request: https://github.com/apache/spark/pull/20722#discussion_r171975765 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala --- @@ -159,6 +159,19 @@ private[spark] class Client( logInfo(s"Waiting for application $appName to finish...") watcher.awaitCompletion() logInfo(s"Application $appName finished.") +try { --- End diff -- Per the discussion offline, I think the right solution is move resource management to the driver pod. This way, resource cleanup is guaranteed regardless of the deployment mode. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20721: [DO-NOT-MERGE] [TEST] Try different spark.sql.sources.de...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20721 @dongjoon-hyun I am just trying to see how many queries fail and see whether they are all included. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20718: [SPARK-23514][FOLLOW-UP] Remove more places using sparkC...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20718 **[Test build #87902 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87902/testReport)** for PR 20718 at commit [`358a76a`](https://github.com/apache/spark/commit/358a76ae781016586a892aa077d88f0c27876d76). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20718: [SPARK-23514][FOLLOW-UP] Remove more places using sparkC...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20718 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87902/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20710: [SPARK-23559][SS] Add epoch ID to DataWriterFactory.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20710 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87909/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20710: [SPARK-23559][SS] Add epoch ID to DataWriterFactory.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20710 **[Test build #87909 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87909/testReport)** for PR 20710 at commit [`cb6b2cf`](https://github.com/apache/spark/commit/cb6b2cfeab71abc443f82cc4b016bfc572d45050). * This patch **fails to generate documentation**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20698: [SPARK-23541][SS] Allow Kafka source to read data with g...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20698 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20698: [SPARK-23541][SS] Allow Kafka source to read data with g...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20698 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20698: [SPARK-23541][SS] Allow Kafka source to read data with g...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20698 **[Test build #87911 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87911/testReport)** for PR 20698 at commit [`1e244e0`](https://github.com/apache/spark/commit/1e244e0d484d13d08b5afeed45bcbd9805254fe1). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20632: [SPARK-3159][ML] Add decision tree pruning
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20632 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87908/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20632: [SPARK-3159][ML] Add decision tree pruning
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20632 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20632: [SPARK-3159][ML] Add decision tree pruning
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20632 **[Test build #87908 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87908/testReport)** for PR 20632 at commit [`0183678`](https://github.com/apache/spark/commit/01836782c828db8ddd928726114d464e273b0a86). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20698: [SPARK-23541][SS] Allow Kafka source to read data with g...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20698 **[Test build #87913 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87913/testReport)** for PR 20698 at commit [`602ab36`](https://github.com/apache/spark/commit/602ab36490a692080682867f98a8a5d8f7b2390d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20710: [SPARK-23559][SS] Add epoch ID to DataWriterFactory.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20710 **[Test build #87912 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87912/testReport)** for PR 20710 at commit [`544eb1b`](https://github.com/apache/spark/commit/544eb1b296bceb213965bf3c5dc1a6264c5b7acd). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20618: [SPARK-23329][SQL] Fix documentation of trigonome...
Github user misutoth commented on a diff in the pull request: https://github.com/apache/spark/pull/20618#discussion_r171991788 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala --- @@ -1500,31 +1534,35 @@ object functions { } /** - * Computes the cosine of the given value. Units in radians. + * @param e angle in radians + * @return cosine of the angle, as if computed by [[java.lang.Math#cos]] --- End diff -- Replaced all of them as suggested. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20618: [SPARK-23329][SQL] Fix documentation of trigonome...
Github user misutoth commented on a diff in the pull request: https://github.com/apache/spark/pull/20618#discussion_r171991647 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala --- @@ -512,7 +529,11 @@ case class Rint(child: Expression) extends UnaryMathExpression(math.rint, "ROUND case class Signum(child: Expression) extends UnaryMathExpression(math.signum, "SIGNUM") @ExpressionDescription( - usage = "_FUNC_(expr) - Returns the sine of `expr`.", + usage = "_FUNC_(expr) - Returns the sine of `expr`, as if computed by `java.lang.Math._FUNC_`.", --- End diff -- As far as the trigonometric functions concerned they should be now in sync I think. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20698: [SPARK-23541][SS] Allow Kafka source to read data with g...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20698 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20618: [SPARK-23329][SQL] Fix documentation of trigonometric fu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20618 **[Test build #87914 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87914/testReport)** for PR 20618 at commit [`627e204`](https://github.com/apache/spark/commit/627e204ed03cfd6caa06e8f64dc605b62f4d2e5e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20698: [SPARK-23541][SS] Allow Kafka source to read data with g...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20698 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1246/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20618: [SPARK-23329][SQL] Fix documentation of trigonome...
Github user misutoth commented on a diff in the pull request: https://github.com/apache/spark/pull/20618#discussion_r171991691 --- Diff: python/pyspark/sql/functions.py --- @@ -173,16 +172,26 @@ def _(): _functions_2_1 = { # unary math functions -'degrees': 'Converts an angle measured in radians to an approximately equivalent angle ' + - 'measured in degrees.', -'radians': 'Converts an angle measured in degrees to an approximately equivalent angle ' + - 'measured in radians.', +'degrees': """Converts an angle measured in radians to an approximately equivalent angle + measured in degrees. --- End diff -- added --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20618: [SPARK-23329][SQL] Fix documentation of trigonome...
Github user misutoth commented on a diff in the pull request: https://github.com/apache/spark/pull/20618#discussion_r171991733 --- Diff: python/pyspark/sql/functions.py --- @@ -173,16 +172,26 @@ def _(): _functions_2_1 = { # unary math functions -'degrees': 'Converts an angle measured in radians to an approximately equivalent angle ' + - 'measured in degrees.', -'radians': 'Converts an angle measured in degrees to an approximately equivalent angle ' + - 'measured in radians.', +'degrees': """Converts an angle measured in radians to an approximately equivalent angle + measured in degrees. + :param col: angle in radians + :return: angle in degrees, as if computed by `java.lang.Math.toDegrees()`""", +'radians': """Converts an angle measured in degrees to an approximately equivalent angle measured in radians. --- End diff -- Yes. I wrapped the text. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20710: [SPARK-23559][SS] Add epoch ID to DataWriterFactory.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20710 **[Test build #87915 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87915/testReport)** for PR 20710 at commit [`79495b1`](https://github.com/apache/spark/commit/79495b1f9e994f77ccf40c47eb2fb0baf5873f66). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20705: [SPARK-23553][TESTS] Tests should not assume the default...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20705 **[Test build #87916 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87916/testReport)** for PR 20705 at commit [`3ec9309`](https://github.com/apache/spark/commit/3ec9309f923405873d73a6e5d9376bba08e050d0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml.featur...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20686 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87904/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20686: [SPARK-22915][MLlib] Streaming tests for spark.ml.featur...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20686 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20722: [SPARK-23571][K8S] Delete auxiliary Kubernetes re...
Github user foxish commented on a diff in the pull request: https://github.com/apache/spark/pull/20722#discussion_r171968886 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala --- @@ -159,6 +159,19 @@ private[spark] class Client( logInfo(s"Waiting for application $appName to finish...") watcher.awaitCompletion() logInfo(s"Application $appName finished.") +try { --- End diff -- (also because this code path isn't invoked in fire-and-forget mode IIUC) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20714: [SPARK-23457][SQL][BRANCH-2.3] Register task completion ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20714 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20714: [SPARK-23457][SQL][BRANCH-2.3] Register task completion ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20714 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1242/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20700: [SPARK-23546][SQL] Refactor stateless methods/values in ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20700 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87899/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20208 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20208 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1243/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20700: [SPARK-23546][SQL] Refactor stateless methods/values in ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20700 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20208 **[Test build #87906 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87906/testReport)** for PR 20208 at commit [`6ae471c`](https://github.com/apache/spark/commit/6ae471c8ecaae3eb3888eecaac1c4e7552bedcc6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20722: [SPARK-23571][K8S] Delete auxiliary Kubernetes re...
Github user foxish commented on a diff in the pull request: https://github.com/apache/spark/pull/20722#discussion_r171972187 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala --- @@ -159,6 +159,19 @@ private[spark] class Client( logInfo(s"Waiting for application $appName to finish...") watcher.awaitCompletion() logInfo(s"Application $appName finished.") +try { --- End diff -- I see! It was the RBAC rules and downscoping them that led us here. I'm concerned not all jobs will actually use this interactive mode of launching. What do you think of just granting more permissions to the driver and allowing cleanup there? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20720: [SPARK-23570] [SQL] Add Spark 2.3.0 in HiveExternalCatal...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20720 Thanks! Also cc @cloud-fan Merged to master/2.3 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20632: [SPARK-3159][ML] Add decision tree pruning
Github user asolimando commented on a diff in the pull request: https://github.com/apache/spark/pull/20632#discussion_r171983753 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala --- @@ -703,4 +707,16 @@ private object RandomForestSuite { val (indices, values) = map.toSeq.sortBy(_._1).unzip Vectors.sparse(size, indices.toArray, values.toArray) } + + @tailrec + private def getSumLeafCounters(nodes: List[Node], acc: Long = 0): Long = --- End diff -- Sorry, I have added them --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20698: [SPARK-23541][SS] Allow Kafka source to read data...
Github user brkyvz commented on a diff in the pull request: https://github.com/apache/spark/pull/20698#discussion_r171988510 --- Diff: external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaOffsetRangeCalculatorSuite.scala --- @@ -0,0 +1,136 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.kafka010 + +import scala.collection.JavaConverters._ + +import org.apache.kafka.common.TopicPartition + +import org.apache.spark.SparkFunSuite +import org.apache.spark.sql.sources.v2.DataSourceOptions + +class KafkaOffsetRangeCalculatorSuite extends SparkFunSuite { + + def testWithMinPartitions(name: String, minPartition: Int) + (f: KafkaOffsetRangeCalculator => Unit): Unit = { +val options = new DataSourceOptions(Map("minPartitions" -> minPartition.toString).asJava) +test(s"with minPartition = $minPartition: $name") { + f(KafkaOffsetRangeCalculator(options)) +} + } + + + test("with no minPartition: N TopicPartitions to N offset ranges") { +val calc = KafkaOffsetRangeCalculator(DataSourceOptions.empty()) +assert( + calc.getRanges( +fromOffsets = Map(tp1 -> 1), +untilOffsets = Map(tp1 -> 2)) == + Seq(KafkaOffsetRange(tp1, 1, 2, None))) + +assert( + calc.getRanges( +fromOffsets = Map(tp1 -> 1), +untilOffsets = Map(tp1 -> 2, tp2 -> 1), Seq.empty) == + Seq(KafkaOffsetRange(tp1, 1, 2, None))) + +assert( + calc.getRanges( +fromOffsets = Map(tp1 -> 1, tp2 -> 1), +untilOffsets = Map(tp1 -> 2)) == + Seq(KafkaOffsetRange(tp1, 1, 2, None))) + +assert( + calc.getRanges( +fromOffsets = Map(tp1 -> 1, tp2 -> 1), +untilOffsets = Map(tp1 -> 2), +executorLocations = Seq("location")) == + Seq(KafkaOffsetRange(tp1, 1, 2, Some("location" + } + + test("with no minPartition: empty ranges ignored") { +val calc = KafkaOffsetRangeCalculator(DataSourceOptions.empty()) +assert( + calc.getRanges( +fromOffsets = Map(tp1 -> 1, tp2 -> 1), +untilOffsets = Map(tp1 -> 2, tp2 -> 1)) == + Seq(KafkaOffsetRange(tp1, 1, 2, None))) + } + + testWithMinPartitions("N TopicPartitions to N offset ranges", 3) { calc => +assert( + calc.getRanges( +fromOffsets = Map(tp1 -> 1, tp2 -> 1, tp3 -> 1), +untilOffsets = Map(tp1 -> 2, tp2 -> 2, tp3 -> 2)) == + Seq( +KafkaOffsetRange(tp1, 1, 2, None), +KafkaOffsetRange(tp2, 1, 2, None), +KafkaOffsetRange(tp3, 1, 2, None))) + } + + testWithMinPartitions("1 TopicPartition to N offset ranges", 4) { calc => +assert( + calc.getRanges( +fromOffsets = Map(tp1 -> 1), +untilOffsets = Map(tp1 -> 5)) == + Seq( +KafkaOffsetRange(tp1, 1, 2, None), +KafkaOffsetRange(tp1, 2, 3, None), +KafkaOffsetRange(tp1, 3, 4, None), +KafkaOffsetRange(tp1, 4, 5, None))) + +assert( + calc.getRanges( +fromOffsets = Map(tp1 -> 1), +untilOffsets = Map(tp1 -> 5), +executorLocations = Seq("location")) == +Seq( + KafkaOffsetRange(tp1, 1, 2, None), + KafkaOffsetRange(tp1, 2, 3, None), + KafkaOffsetRange(tp1, 3, 4, None), + KafkaOffsetRange(tp1, 4, 5, None))) // location pref not set when minPartition is set + } + + testWithMinPartitions("N skewed TopicPartitions to M offset ranges", 3) { calc => --- End diff -- can you also add a test: ``` fromOffsets = Map(tp1 -> 1), untilOffsets = Map(tp1 -> 10) minPartitions = 3 ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail:
[GitHub] spark pull request #20698: [SPARK-23541][SS] Allow Kafka source to read data...
Github user brkyvz commented on a diff in the pull request: https://github.com/apache/spark/pull/20698#discussion_r17198 --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetRangeCalculator.scala --- @@ -0,0 +1,106 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.kafka010 + +import org.apache.kafka.common.TopicPartition + +import org.apache.spark.sql.sources.v2.DataSourceOptions + + +/** + * Class to calculate offset ranges to process based on the the from and until offsets, and + * the configured `minPartitions`. + */ +private[kafka010] class KafkaOffsetRangeCalculator(val minPartitions: Option[Int]) { + require(minPartitions.isEmpty || minPartitions.get > 0) + + import KafkaOffsetRangeCalculator._ + /** + * Calculate the offset ranges that we are going to process this batch. If `minPartitions` + * is not set or is set less than or equal the number of `topicPartitions` that we're going to + * consume, then we fall back to a 1-1 mapping of Spark tasks to Kafka partitions. If + * `numPartitions` is set higher than the number of our `topicPartitions`, then we will split up + * the read tasks of the skewed partitions to multiple Spark tasks. + * The number of Spark tasks will be *approximately* `numPartitions`. It can be less or more + * depending on rounding errors or Kafka partitions that didn't receive any new data. + */ + def getRanges( + fromOffsets: PartitionOffsetMap, + untilOffsets: PartitionOffsetMap, + executorLocations: Seq[String] = Seq.empty): Seq[KafkaOffsetRange] = { +val partitionsToRead = untilOffsets.keySet.intersect(fromOffsets.keySet) + +val offsetRanges = partitionsToRead.toSeq.map { tp => + KafkaOffsetRange(tp, fromOffsets(tp), untilOffsets(tp), preferredLoc = None) +}.filter(_.size > 0) + +// If minPartitions not set or there are enough partitions to satisfy minPartitions +if (minPartitions.isEmpty || offsetRanges.size > minPartitions.get) { + // Assign preferred executor locations to each range such that the same topic-partition is + // preferentially read from the same executor and the KafkaConsumer can be reused. + offsetRanges.map { range => +range.copy(preferredLoc = getLocation(range.topicPartition, executorLocations)) + } +} else { + + // Splits offset ranges with relatively large amount of data to smaller ones. + val totalSize = offsetRanges.map(o => o.untilOffset - o.fromOffset).sum --- End diff -- nit: `map(_.size).sum` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20698: [SPARK-23541][SS] Allow Kafka source to read data...
Github user brkyvz commented on a diff in the pull request: https://github.com/apache/spark/pull/20698#discussion_r171987888 --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetRangeCalculator.scala --- @@ -0,0 +1,106 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.kafka010 + +import org.apache.kafka.common.TopicPartition + +import org.apache.spark.sql.sources.v2.DataSourceOptions + + +/** + * Class to calculate offset ranges to process based on the the from and until offsets, and + * the configured `minPartitions`. + */ +private[kafka010] class KafkaOffsetRangeCalculator(val minPartitions: Option[Int]) { + require(minPartitions.isEmpty || minPartitions.get > 0) + + import KafkaOffsetRangeCalculator._ + /** + * Calculate the offset ranges that we are going to process this batch. If `minPartitions` + * is not set or is set less than or equal the number of `topicPartitions` that we're going to + * consume, then we fall back to a 1-1 mapping of Spark tasks to Kafka partitions. If + * `numPartitions` is set higher than the number of our `topicPartitions`, then we will split up + * the read tasks of the skewed partitions to multiple Spark tasks. + * The number of Spark tasks will be *approximately* `numPartitions`. It can be less or more + * depending on rounding errors or Kafka partitions that didn't receive any new data. + */ + def getRanges( + fromOffsets: PartitionOffsetMap, + untilOffsets: PartitionOffsetMap, + executorLocations: Seq[String] = Seq.empty): Seq[KafkaOffsetRange] = { +val partitionsToRead = untilOffsets.keySet.intersect(fromOffsets.keySet) + +val offsetRanges = partitionsToRead.toSeq.map { tp => + KafkaOffsetRange(tp, fromOffsets(tp), untilOffsets(tp), preferredLoc = None) +}.filter(_.size > 0) + +// If minPartitions not set or there are enough partitions to satisfy minPartitions +if (minPartitions.isEmpty || offsetRanges.size > minPartitions.get) { + // Assign preferred executor locations to each range such that the same topic-partition is + // preferentially read from the same executor and the KafkaConsumer can be reused. + offsetRanges.map { range => +range.copy(preferredLoc = getLocation(range.topicPartition, executorLocations)) + } +} else { + + // Splits offset ranges with relatively large amount of data to smaller ones. + val totalSize = offsetRanges.map(o => o.untilOffset - o.fromOffset).sum + val idealRangeSize = totalSize.toDouble / minPartitions.get + + offsetRanges.flatMap { range => +// Split the current range into subranges as close to the ideal range size +val rangeSize = range.untilOffset - range.fromOffset +val numSplitsInRange = math.round(rangeSize.toDouble / idealRangeSize).toInt --- End diff -- nit: `range.size`, you may remove `rangeSize` above --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20710: [SPARK-23559][SS] Add epoch ID to DataWriterFacto...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/20710#discussion_r171993066 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataWriterFactory.java --- @@ -48,6 +48,9 @@ * same task id but different attempt number, which means there are multiple * tasks with the same task id running at the same time. Implementations can * use this attempt number to distinguish writers of different task attempts. + * @param epochId A monotonically increasing id for streaming queries that are split in to + *discrete periods of execution. For queries that execute as a single batch, this + *id will always be zero. */ - DataWriter createDataWriter(int partitionId, int attemptNumber); + DataWriter createDataWriter(int partitionId, int attemptNumber, long epochId); --- End diff -- Add clear lifecycle semantics. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20710: [SPARK-23559][SS] Add epoch ID to DataWriterFacto...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/20710#discussion_r171993101 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2.scala --- @@ -198,7 +201,7 @@ object DataWritingSparkTask extends Logging { })(catchBlock = { // If there is an error, abort this writer logError(s"Writer for partition ${context.partitionId()} is aborting.") -dataWriter.abort() +if (dataWriter != null) dataWriter.abort() logError(s"Writer for partition ${context.partitionId()} aborted.") --- End diff -- nit: add comment that the exception will be rethrown. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20698: [SPARK-23541][SS] Allow Kafka source to read data with g...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20698 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20698: [SPARK-23541][SS] Allow Kafka source to read data with g...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20698 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87913/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20710: [SPARK-23559][SS] Add epoch ID to DataWriterFactory.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20710 **[Test build #87918 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87918/testReport)** for PR 20710 at commit [`215c225`](https://github.com/apache/spark/commit/215c225c5a1623cfa02f617201e21067bbf6088a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20722: [SPARK-23571][K8S] Delete auxiliary Kubernetes re...
Github user foxish commented on a diff in the pull request: https://github.com/apache/spark/pull/20722#discussion_r171968410 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala --- @@ -159,6 +159,19 @@ private[spark] class Client( logInfo(s"Waiting for application $appName to finish...") watcher.awaitCompletion() logInfo(s"Application $appName finished.") +try { --- End diff -- Why are we doing this in client code? Driver shutdown is the right place to perform cleanup right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20723: [SPARK-23538][core] Remove custom configuration for SSL ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20723 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1244/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20723: [SPARK-23538][core] Remove custom configuration for SSL ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20723 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20632: [SPARK-3159][ML] Add decision tree pruning
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/20632#discussion_r171982559 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala --- @@ -703,4 +707,16 @@ private object RandomForestSuite { val (indices, values) = map.toSeq.sortBy(_._1).unzip Vectors.sparse(size, indices.toArray, values.toArray) } + + @tailrec + private def getSumLeafCounters(nodes: List[Node], acc: Long = 0): Long = --- End diff -- Need to enclose the function body in curly braces --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20698: [SPARK-23541][SS] Allow Kafka source to read data with g...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20698 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87911/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20632: [SPARK-3159][ML] Add decision tree pruning
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20632 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87910/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20632: [SPARK-3159][ML] Add decision tree pruning
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20632 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20710: [SPARK-23559][SS] Add epoch ID to DataWriterFacto...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/20710#discussion_r171993276 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2.scala --- @@ -172,17 +173,19 @@ object DataWritingSparkTask extends Logging { writeTask: DataWriterFactory[InternalRow], context: TaskContext, iter: Iterator[InternalRow]): WriterCommitMessage = { -val dataWriter = writeTask.createDataWriter(context.partitionId(), context.attemptNumber()) val epochCoordinator = EpochCoordinatorRef.get( context.getLocalProperty(ContinuousExecution.EPOCH_COORDINATOR_ID_KEY), SparkEnv.get) val currentMsg: WriterCommitMessage = null var currentEpoch = context.getLocalProperty(ContinuousExecution.START_EPOCH_KEY).toLong do { + var dataWriter: DataWriter[InternalRow] = null // write the data and commit this writer. Utils.tryWithSafeFinallyAndFailureCallbacks(block = { try { + dataWriter = writeTask.createDataWriter( +context.partitionId(), context.attemptNumber(), currentEpoch) iter.foreach(dataWriter.write) --- End diff -- fix this! dont use foreach. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20710: [SPARK-23559][SS] Add epoch ID to DataWriterFacto...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/20710#discussion_r171993467 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataWriter.java --- @@ -36,8 +36,9 @@ * {@link DataSourceWriter#commit(WriterCommitMessage[])} with commit messages from other data * writers. If this data writer fails(one record fails to write or {@link #commit()} fails), an * exception will be sent to the driver side, and Spark will retry this writing task for some times, --- End diff -- Spark may retry... (in continuous we dont retry the task) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20721: [DO-NOT-MERGE] [TEST] Try different spark.sql.sources.de...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/20721 Hi, @gatorsmile . Actually, this one is tested via #20705 . Do you want to get a result on `json`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20714: [SPARK-23457][SQL][BRANCH-2.3] Register task completion ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20714 **[Test build #87905 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87905/testReport)** for PR 20714 at commit [`d15eba7`](https://github.com/apache/spark/commit/d15eba754a59721bc7d9cdc7d374f2f323d21e41). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20705: [SPARK-23553][TESTS] Tests should not assume the default...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20705 Could you change the default value to json? Can all the tests pass? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20723: [SPARK-23538][core] Remove custom configuration f...
GitHub user vanzin opened a pull request: https://github.com/apache/spark/pull/20723 [SPARK-23538][core] Remove custom configuration for SSL client. These options were used to configure the built-in JRE SSL libraries when downloading files from HTTPS servers. But because they were also used to set up the now (long) removed internal HTTPS file server, their default configuration chose convenience over security by having overly lenient settings. This change removes the configuration options that affect the JRE SSL libraries. The JRE trust store can still be configured via system properties (or globally in the JRE security config). The only lost functionality is not being able to disable the default hostname verifier when using spark-submit, which should be fine since Spark itself is not using https for any internal functionality anymore. I also removed the HTTP-related code from the REPL class loader, since we haven't had a HTTP server for REPL-generated classes for a while. You can merge this pull request into a Git repository by running: $ git pull https://github.com/vanzin/spark SPARK-23538 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20723.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20723 commit c83611eca573f3f460790f4fde7bea7ef7887839 Author: Marcelo VanzinDate: 2018-03-02T21:43:48Z [SPARK-23538][core] Remove custom configuration for SSL client. These options were used to configure the built-in JRE SSL libraries when downloading files from HTTPS servers. But because they were also used to set up the now (long) removed internal HTTPS file server, their default configuration chose convenience over security by having overly lenient settings. This change removes the configuration options that affect the JRE SSL libraries. The JRE trust store can still be configured via system properties (or globally in the JRE security config). The only lost functionality is not being able to disable the default hostname verifier when using spark-submit, which should be fine since Spark itself is not using https for any internal functionality anymore. I also removed the HTTP-related code from the REPL class loader, since we haven't had a HTTP server for REPL-generated classes for a while. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20721: [DO-NOT-MERGE] [TEST] Try different spark.sql.sou...
Github user gatorsmile closed the pull request at: https://github.com/apache/spark/pull/20721 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20723: [SPARK-23538][core] Remove custom configuration for SSL ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20723 **[Test build #87907 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87907/testReport)** for PR 20723 at commit [`c83611e`](https://github.com/apache/spark/commit/c83611eca573f3f460790f4fde7bea7ef7887839). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20720: [SPARK-23570] [SQL] Add Spark 2.3.0 in HiveExtern...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20720 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20632: [SPARK-3159][ML] Add decision tree pruning
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20632 **[Test build #87908 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87908/testReport)** for PR 20632 at commit [`0183678`](https://github.com/apache/spark/commit/01836782c828db8ddd928726114d464e273b0a86). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20632: [SPARK-3159][ML] Add decision tree pruning
Github user asolimando commented on the issue: https://github.com/apache/spark/pull/20632 Thanks for the suggestion @sethah, I have updated the PR with the extra check (both tests). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20632: [SPARK-3159][ML] Add decision tree pruning
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20632 **[Test build #87910 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87910/testReport)** for PR 20632 at commit [`3dfe86c`](https://github.com/apache/spark/commit/3dfe86c8376c705137d3b2ae57565a3c30d4a96d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20698: [SPARK-23541][SS] Allow Kafka source to read data with g...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20698 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1245/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20710: [SPARK-23559][SS] Add epoch ID to DataWriterFactory.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20710 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20718: [SPARK-23514][FOLLOW-UP] Remove more places using sparkC...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20718 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20698: [SPARK-23541][SS] Allow Kafka source to read data with g...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20698 **[Test build #87911 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87911/testReport)** for PR 20698 at commit [`1e244e0`](https://github.com/apache/spark/commit/1e244e0d484d13d08b5afeed45bcbd9805254fe1). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20698: [SPARK-23541][SS] Allow Kafka source to read data...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/20698#discussion_r171989658 --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetRangeCalculator.scala --- @@ -0,0 +1,106 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.kafka010 + +import org.apache.kafka.common.TopicPartition + +import org.apache.spark.sql.sources.v2.DataSourceOptions + + +/** + * Class to calculate offset ranges to process based on the the from and until offsets, and + * the configured `minPartitions`. + */ +private[kafka010] class KafkaOffsetRangeCalculator(val minPartitions: Option[Int]) { + require(minPartitions.isEmpty || minPartitions.get > 0) + + import KafkaOffsetRangeCalculator._ + /** + * Calculate the offset ranges that we are going to process this batch. If `minPartitions` + * is not set or is set less than or equal the number of `topicPartitions` that we're going to + * consume, then we fall back to a 1-1 mapping of Spark tasks to Kafka partitions. If + * `numPartitions` is set higher than the number of our `topicPartitions`, then we will split up + * the read tasks of the skewed partitions to multiple Spark tasks. + * The number of Spark tasks will be *approximately* `numPartitions`. It can be less or more + * depending on rounding errors or Kafka partitions that didn't receive any new data. + */ + def getRanges( + fromOffsets: PartitionOffsetMap, + untilOffsets: PartitionOffsetMap, + executorLocations: Seq[String] = Seq.empty): Seq[KafkaOffsetRange] = { +val partitionsToRead = untilOffsets.keySet.intersect(fromOffsets.keySet) + +val offsetRanges = partitionsToRead.toSeq.map { tp => + KafkaOffsetRange(tp, fromOffsets(tp), untilOffsets(tp), preferredLoc = None) +}.filter(_.size > 0) + +// If minPartitions not set or there are enough partitions to satisfy minPartitions +if (minPartitions.isEmpty || offsetRanges.size > minPartitions.get) { + // Assign preferred executor locations to each range such that the same topic-partition is + // preferentially read from the same executor and the KafkaConsumer can be reused. + offsetRanges.map { range => +range.copy(preferredLoc = getLocation(range.topicPartition, executorLocations)) + } +} else { + + // Splits offset ranges with relatively large amount of data to smaller ones. + val totalSize = offsetRanges.map(o => o.untilOffset - o.fromOffset).sum + val idealRangeSize = totalSize.toDouble / minPartitions.get + + offsetRanges.flatMap { range => +// Split the current range into subranges as close to the ideal range size +val rangeSize = range.untilOffset - range.fromOffset +val numSplitsInRange = math.round(rangeSize.toDouble / idealRangeSize).toInt + +(0 until numSplitsInRange).map { i => + val splitStart = range.fromOffset + rangeSize * (i.toDouble / numSplitsInRange) + val splitEnd = range.fromOffset + rangeSize * ((i.toDouble + 1) / numSplitsInRange) + KafkaOffsetRange( +range.topicPartition, splitStart.toLong, splitEnd.toLong, preferredLoc = None) +} + } +} + } + + private def getLocation(tp: TopicPartition, executorLocations: Seq[String]): Option[String] = { +def floorMod(a: Long, b: Int): Int = ((a % b).toInt + b) % b + +val numExecutors = executorLocations.length +if (numExecutors > 0) { + // This allows cached KafkaConsumers in the executors to be re-used to read the same + // partition in every batch. + Some(executorLocations(floorMod(tp.hashCode, numExecutors))) +} else None + } +} + +private[kafka010] object KafkaOffsetRangeCalculator { + + def apply(options:
[GitHub] spark issue #20705: [SPARK-23553][TESTS] Tests should not assume the default...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/20705 Sure, @gatorsmile . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20710: [SPARK-23559][SS] Add epoch ID to DataWriterFacto...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/20710#discussion_r171993622 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataWriterFactory.java --- @@ -48,6 +48,9 @@ * same task id but different attempt number, which means there are multiple * tasks with the same task id running at the same time. Implementations can * use this attempt number to distinguish writers of different task attempts. + * @param epochId A monotonically increasing id for streaming queries that are split in to + *discrete periods of execution. For queries that execute as a single batch, this --- End diff -- For non-streaming queries, this... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20710: [SPARK-23559][SS] Add epoch ID to DataWriterFacto...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/20710#discussion_r171993519 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataWriter.java --- @@ -36,8 +36,9 @@ * {@link DataSourceWriter#commit(WriterCommitMessage[])} with commit messages from other data * writers. If this data writer fails(one record fails to write or {@link #commit()} fails), an * exception will be sent to the driver side, and Spark will retry this writing task for some times, --- End diff -- for some times --> for a few times --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20710: [SPARK-23559][SS] Add epoch ID to DataWriterFacto...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/20710#discussion_r171993559 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataWriter.java --- @@ -36,8 +36,9 @@ * {@link DataSourceWriter#commit(WriterCommitMessage[])} with commit messages from other data * writers. If this data writer fails(one record fails to write or {@link #commit()} fails), an * exception will be sent to the driver side, and Spark will retry this writing task for some times, --- End diff -- Break this sentence. very long. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20698: [SPARK-23541][SS] Allow Kafka source to read data with g...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20698 **[Test build #87913 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87913/testReport)** for PR 20698 at commit [`602ab36`](https://github.com/apache/spark/commit/602ab36490a692080682867f98a8a5d8f7b2390d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20724: [SPARK-18630][PYTHON][ML] Move del method from JavaParam...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20724 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20725: [WIP][SPARK-23555][PYTHON] Add BinaryType support for Ar...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20725 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20725: [WIP][SPARK-23555][PYTHON] Add BinaryType support for Ar...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20725 **[Test build #87919 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87919/testReport)** for PR 20725 at commit [`afcb0d5`](https://github.com/apache/spark/commit/afcb0d5608c17d2fc004a0b6d4af4573abca4e4b). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20725: [WIP][SPARK-23555][PYTHON] Add BinaryType support for Ar...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20725 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87919/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20208 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20208 **[Test build #87906 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87906/testReport)** for PR 20208 at commit [`6ae471c`](https://github.com/apache/spark/commit/6ae471c8ecaae3eb3888eecaac1c4e7552bedcc6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20208 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87906/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20718: [SPARK-23514][FOLLOW-UP] Remove more places using sparkC...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20718 thanks, merging to master! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20723: [SPARK-23538][core] Remove custom configuration for SSL ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20723 **[Test build #87907 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87907/testReport)** for PR 20723 at commit [`c83611e`](https://github.com/apache/spark/commit/c83611eca573f3f460790f4fde7bea7ef7887839). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20618: [SPARK-23329][SQL] Fix documentation of trigonometric fu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20618 **[Test build #87914 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87914/testReport)** for PR 20618 at commit [`627e204`](https://github.com/apache/spark/commit/627e204ed03cfd6caa06e8f64dc605b62f4d2e5e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20618: [SPARK-23329][SQL] Fix documentation of trigonometric fu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20618 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20618: [SPARK-23329][SQL] Fix documentation of trigonometric fu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20618 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87914/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20710: [SPARK-23559][SS] Add epoch ID to DataWriterFactory.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20710 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87917/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20710: [SPARK-23559][SS] Add epoch ID to DataWriterFactory.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20710 **[Test build #87917 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87917/testReport)** for PR 20710 at commit [`9fb74e2`](https://github.com/apache/spark/commit/9fb74e2ccbe668ac6a1f2d2240b67a04400ba78b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20725: [WIP][SPARK-23555][PYTHON] Add BinaryType support for Ar...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20725 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1248/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20724: [SPARK-18630][PYTHON][ML] Move del method from JavaParam...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20724 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20725: [WIP][SPARK-23555][PYTHON] Add BinaryType support for Ar...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20725 **[Test build #87919 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87919/testReport)** for PR 20725 at commit [`afcb0d5`](https://github.com/apache/spark/commit/afcb0d5608c17d2fc004a0b6d4af4573abca4e4b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20725: [WIP][SPARK-23555][PYTHON] Add BinaryType support for Ar...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20725 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20725: [WIP][SPARK-23555][PYTHON] Add BinaryType support...
GitHub user BryanCutler opened a pull request: https://github.com/apache/spark/pull/20725 [WIP][SPARK-23555][PYTHON] Add BinaryType support for Arrow ## What changes were proposed in this pull request? Adding `BinaryType` support for Arrow in pyspark. ## How was this patch tested? Additional unit tests in pyspark for code paths that use Arrow You can merge this pull request into a Git repository by running: $ git pull https://github.com/BryanCutler/spark arrow-binary-type-support-SPARK-23555 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20725.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20725 commit afcb0d5608c17d2fc004a0b6d4af4573abca4e4b Author: Bryan CutlerDate: 2018-03-03T00:32:45Z added support for binary type, not currently working due to arrow error --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20725: [WIP][SPARK-23555][PYTHON] Add BinaryType support for Ar...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/20725 This is a WIP as some issues need to be worked out on the Arrow side and need tests for pandas_udfs. Currently get the following error when converting from pandas: ``` File "/home/bryan/git/spark/python/pyspark/serializers.py", line 237, in create_array return pa.Array.from_pandas(s, mask=mask, type=t) File "array.pxi", line 335, in pyarrow.lib.Array.from_pandas File "array.pxi", line 170, in pyarrow.lib.array File "array.pxi", line 70, in pyarrow.lib._ndarray_to_array File "error.pxi", line 85, in pyarrow.lib.check_status ArrowNotImplementedError: No cast implemented from binary to binary ``` The corresponding JIRA is https://issues.apache.org/jira/browse/ARROW-2141 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20724: [SPARK-18630][PYTHON][ML] Move del method from JavaParam...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/20724 add to whitelist --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20698: [SPARK-23541][SS] Allow Kafka source to read data with g...
Github user tdas commented on the issue: https://github.com/apache/spark/pull/20698 Thank you. Merging to master only as this is a new feature touching production code paths. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20723: [SPARK-23538][core] Remove custom configuration for SSL ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20723 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87907/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20458: changed scala example from java "style" to scala
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20458 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20723: [SPARK-23538][core] Remove custom configuration for SSL ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20723 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org