[GitHub] spark issue #17368: [SPARK-20039][ML] rename ChiSquare to ChiSquareTest
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/17368 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14694: [SPARK-17121][SPARKSUBMIT] Support _HOST replacement for...
Github user wolf31o2 commented on the issue: https://github.com/apache/spark/pull/14694 This is useful for the Spark HistoryServer, especially if it's configured to store history in HDFS. That's where I've run into this issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17324: [SPARK-19969] [ML] Imputer doc and example
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/17324 Will take a look this week --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14731 **[Test build #74990 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74990/testReport)** for PR 14731 at commit [`a3aaf26`](https://github.com/apache/spark/commit/a3aaf267d2ac30c012b4a71b7a80e28a49ff10be). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16499: [SPARK-17204][CORE] Fix replicated off heap storage
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16499 > @mallman can you send a new PR for 2.0? thanks! Will do. Do I need to open a new JIRA ticket for that? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17364: [SPARK-20038] [SQL]: FileFormatWriter.ExecuteWriteTask.r...
Github user steveloughran commented on the issue: https://github.com/apache/spark/pull/17364 looking some more, yes, as `tryWithSafeFinallyAndFailureCallbacks` wraps task commit, it guarantees that the original cause doesn't get lost. The abortJob code isn't so well guarded, and looks like a failure there my hide a previous one (like a commitJob failure). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17295: [SPARK-19556][core] Do not encrypt block manager data in...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17295 **[Test build #74991 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74991/testReport)** for PR 17295 at commit [`107e3e7`](https://github.com/apache/spark/commit/107e3e72e81d2c7813d832d3e9c2beab89e01379). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16499: [SPARK-17204][CORE] Fix replicated off heap storage
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16499 You do not need to open the new JIRA. You can still use the same JIRA number --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17368: [SPARK-20039][ML] rename ChiSquare to ChiSquareTest
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/17368 Yep, thanks for confirming that @srowen and checking it out @imatiach-msft and @MLnick ! Merging with master --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17368: [SPARK-20039][ML] rename ChiSquare to ChiSquareTe...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17368 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17302: [SPARK-19959][SQL] Fix to throw NullPointerException in ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17302 **[Test build #74988 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74988/testReport)** for PR 17302 at commit [`d7d0a36`](https://github.com/apache/spark/commit/d7d0a36f6b4fb78cc0a3a13f870a41b03adf882f). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17302: [SPARK-19959][SQL] Fix to throw NullPointerException in ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17302 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74988/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17302: [SPARK-19959][SQL] Fix to throw NullPointerException in ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17302 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17350: [SPARK-20017][SQL] change the nullability of function 'S...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17350 **[Test build #74986 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74986/testReport)** for PR 17350 at commit [`1260ef7`](https://github.com/apache/spark/commit/1260ef7baf3382fe3009302f37462e82d3550bb2). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17350: [SPARK-20017][SQL] change the nullability of function 'S...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17350 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17350: [SPARK-20017][SQL] change the nullability of function 'S...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17350 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74986/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17302: [SPARK-19959][SQL] Fix to throw NullPointerException in ...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/17302 Jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14731 **[Test build #74990 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74990/testReport)** for PR 14731 at commit [`a3aaf26`](https://github.com/apache/spark/commit/a3aaf267d2ac30c012b4a71b7a80e28a49ff10be). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14731 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...
Github user ericl commented on the issue: https://github.com/apache/spark/pull/17166 jenkins retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14731 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74990/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17302: [SPARK-19959][SQL] Fix to throw NullPointerException in ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17302 **[Test build #74992 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74992/testReport)** for PR 17302 at commit [`d7d0a36`](https://github.com/apache/spark/commit/d7d0a36f6b4fb78cc0a3a13f870a41b03adf882f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17166 **[Test build #74993 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74993/testReport)** for PR 17166 at commit [`884a3ad`](https://github.com/apache/spark/commit/884a3ad7308e69c0ca010c344133bcce6582920d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17377: [SPARK-19949][SQL][FOLLOW-UP] Make parse modes as...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17377#discussion_r107235501 --- Diff: python/pyspark/sql/readwriter.py --- @@ -369,10 +369,10 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non :param maxCharsPerColumn: defines the maximum number of characters allowed for any given value being read. If None is set, it uses the default value, ``-1`` meaning unlimited length. -:param maxMalformedLogPerPartition: sets the maximum number of malformed rows Spark will -log for each partition. Malformed records beyond this -number will be ignored. If None is set, it -uses the default value, ``10``. +:param maxMalformedLogPerPartition: previously sets the maximum number of malformed rows +Spark will log. However, it does not log them after +2.2.0. This parameter exists only for backwards +compatibility for positional arguments. --- End diff -- Let us simplify it to > This parameter is no longer used since Spark 2.2.0. If specified, it is ignored. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17377: [SPARK-19949][SQL][FOLLOW-UP] Make parse modes as...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17377#discussion_r107235659 --- Diff: python/pyspark/sql/streaming.py --- @@ -625,6 +625,10 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non :param maxCharsPerColumn: defines the maximum number of characters allowed for any given value being read. If None is set, it uses the default value, ``-1`` meaning unlimited length. +:param maxMalformedLogPerPartition: previously sets the maximum number of malformed rows +Spark will log. However, it does not log them after +2.2.0. This parameter exists only for backwards +compatibility for positional arguments. --- End diff -- The same here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17350: [SPARK-20017][SQL] change the nullability of function 'S...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/17350 LGTM Since it is very close to code freeze, let me merge it to master and 2.1 at first. You can submit the PR to address the issues as a follow-up PR. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17350: [SPARK-20017][SQL] change the nullability of func...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17350 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/17166#discussion_r107233146 --- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala --- @@ -302,12 +298,12 @@ private[spark] class Executor( // If this task has been killed before we deserialized it, let's quit now. Otherwise, // continue executing the task. -if (killed) { +if (maybeKillReason.isDefined) { // Throw an exception rather than returning, because returning within a try{} block // causes a NonLocalReturnControl exception to be thrown. The NonLocalReturnControl // exception will be caught by the catch block, leading to an incorrect ExceptionFailure // for the task. - throw new TaskKilledException + throw new TaskKilledException(maybeKillReason.get) --- End diff -- Same as above here - atomic use of `maybeKillReason` required. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/17166#discussion_r107239694 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala --- @@ -467,7 +474,7 @@ private[spark] class TaskSchedulerImpl private[scheduler]( taskState: TaskState, reason: TaskFailedReason): Unit = synchronized { taskSetManager.handleFailedTask(tid, taskState, reason) -if (!taskSetManager.isZombie && taskState != TaskState.KILLED) { +if (!taskSetManager.isZombie) { --- End diff -- @ericl Actually that is not correct. Killed tasks were not candidates for resubmission on failure; and hence there is no need to revive offers when task kills are detected. If they are to be made candidates, we need to introduce this expectation explicit elsewhere also to be consistent. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/17166#discussion_r107235395 --- Diff: core/src/test/scala/org/apache/spark/SparkContextSuite.scala --- @@ -569,8 +575,10 @@ class SparkContextSuite extends SparkFunSuite with LocalSparkContext with Eventu Thread.sleep(999) } // second attempt succeeds immediately +SparkContextSuite.taskSucceeded = true } } +assert(SparkContextSuite.taskSucceeded) --- End diff -- Both listener and the task are both setting taskSuceeded ? That does not look right ... I am assuming we need one failure to be raised with the appropriate message, one task success - to ensure listener success. Additionally, re-execution of task to indicate success of task (though this aspect should be covered in some other test already). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/17166#discussion_r107234055 --- Diff: core/src/main/scala/org/apache/spark/TaskContextImpl.scala --- @@ -59,8 +59,8 @@ private[spark] class TaskContextImpl( /** List of callback functions to execute when the task fails. */ @transient private val onFailureCallbacks = new ArrayBuffer[TaskFailureListener] - // Whether the corresponding task has been killed. - @volatile private var interrupted: Boolean = false + // If defined, the corresponding task has been killed for the contained reason. + @volatile private var maybeKillReason: Option[String] = None --- End diff -- nit: Overloading `maybeKillReason` to indicate `interrupted` status smells a bit; but might be ok for now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/17166#discussion_r107234445 --- Diff: core/src/main/scala/org/apache/spark/scheduler/Task.scala --- @@ -160,15 +160,20 @@ private[spark] abstract class Task[T]( // A flag to indicate whether the task is killed. This is used in case context is not yet // initialized when kill() is invoked. - @volatile @transient private var _killed = false + @volatile @transient private var _maybeKillReason: String = null --- End diff -- Any reason to make this a String and not Option[String] - like other places it is defined/used ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/17166#discussion_r107237325 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala --- @@ -215,7 +215,8 @@ private[spark] class PythonRunner( case e: Exception if context.isInterrupted => logDebug("Exception thrown after task interruption", e) -throw new TaskKilledException +context.killTaskIfInterrupted() +null // not reached --- End diff -- nit: It would be good if we could directly throw the exception here - instead of relying on killTaskIfInterrupted to do the right thing (it is interrupted already according to the case check) Not only will it not remove the unreachable `null`, but also ensure future changes to `killTaskIfInterrupted` or interrupt reset, etc does not break this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/17166#discussion_r107235703 --- Diff: core/src/main/scala/org/apache/spark/TaskEndReason.scala --- @@ -212,8 +212,8 @@ case object TaskResultLost extends TaskFailedReason { * Task was killed intentionally and needs to be rescheduled. */ @DeveloperApi -case object TaskKilled extends TaskFailedReason { - override def toErrorString: String = "TaskKilled (killed intentionally)" +case class TaskKilled(reason: String) extends TaskFailedReason { + override def toErrorString: String = s"TaskKilled ($reason)" --- End diff -- That is unfortunate, but looks like it cant be helped if we need this feature. Probably something to keep in mind with future use of case objects ! Thx for clarifying. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/17166#discussion_r107232894 --- Diff: core/src/main/scala/org/apache/spark/TaskContextImpl.scala --- @@ -140,16 +140,22 @@ private[spark] class TaskContextImpl( } /** Marks the task for interruption, i.e. cancellation. */ - private[spark] def markInterrupted(): Unit = { -interrupted = true + private[spark] def markInterrupted(reason: String): Unit = { +maybeKillReason = Some(reason) + } + + private[spark] override def killTaskIfInterrupted(): Unit = { +if (maybeKillReason.isDefined) { + throw new TaskKilledException(maybeKillReason.get) --- End diff -- This is not thread safe - while technically we do not allow kill reason to be reset to None right now and might be fine, it can lead to future issues. Either make all access/updates to kill reason synchronized; or capture `maybeKillReason` to a local variable and use that in the `if` and `throw` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17350: [SPARK-20017][SQL] change the nullability of function 'S...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/17350 @zhaorongsheng Not sure whether you can help us check whether all the functions have an issue in nullability setting? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17130: [SPARK-19791] [ML] Add doc and example for fpgrowth
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17130 **[Test build #74994 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74994/testReport)** for PR 17130 at commit [`9fef280`](https://github.com/apache/spark/commit/9fef280751378dbeaa843c673fd962192320a5b1). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17290: [SPARK-16599][CORE] java.util.NoSuchElementException: No...
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/17290 I agree with @srowen I dont see how this change affects the test. `blocksWithReleasedLocks` should be unchanged w.r.t this test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17377: [SPARK-19949][SQL][FOLLOW-UP] Make parse modes as...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17377#discussion_r107243921 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ParseMode.scala --- @@ -17,25 +17,35 @@ package org.apache.spark.sql.catalyst.util -object ParseModes { - val PERMISSIVE_MODE = "PERMISSIVE" - val DROP_MALFORMED_MODE = "DROPMALFORMED" - val FAIL_FAST_MODE = "FAILFAST" +import org.apache.spark.internal.Logging - val DEFAULT = PERMISSIVE_MODE +object ParseMode extends Enumeration with Logging { --- End diff -- Not sure whether we should use JAVA Enum instead. cc @cloud-fan --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17360: [SPARK-20029][ML] ML LinearRegression supports bound con...
Github user sethah commented on the issue: https://github.com/apache/spark/pull/17360 @yanboliang Thanks for your feedback! The design of the optimizer interface, or even whether it should be included at all, is definitely open for discussion and your suggestions are much appreciated. If SPARK-17136 proceeds as you suggest (internal optimization API that allows users to register optimizers) then it is possible that this PR does not conflict with that JIRA (though I don't know about the details of that, so even that I'm not sure of). However, that matter is far from settled. If we end up deciding to provide the external optimizer API as is currently suggested in that JIRA, then these two _do_ conflict. If we add the ability to specify parameter bounds on the estimator, then add an optimizer API, we have added yet more optimizer parameters to the estimator that can conflict with parameters of the optimizer provided to the estimator. My point is that I think these are two competing approaches and we should settle on one over the other before we make API changes that cannot be undone. I'm open to potentially changing the design of SPARK-17136, but we need to decide on something first. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17377: [SPARK-19949][SQL][FOLLOW-UP] Make parse modes as enum a...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/17377 So far, the documentation of these data source options are missing. In the last release, we clean up the [JDBC options](http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases) in the documentation. Do you think you have the bandwidth to do it for csv and json? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17336: [SPARK-20003] [ML] FPGrowthModel setMinConfidence should...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17336 **[Test build #74995 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74995/testReport)** for PR 17336 at commit [`9c046c3`](https://github.com/apache/spark/commit/9c046c3bb8dfd6dd0fa2799d434a4f92cbb1b802). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17378: [SPARK-20046][SQL] Facilitate loop optimizations ...
GitHub user kiszk opened a pull request: https://github.com/apache/spark/pull/17378 [SPARK-20046][SQL] Facilitate loop optimizations in a JIT compiler regarding sqlContext.read.parquet() ## What changes were proposed in this pull request? This PR improves performance of operations with `sqlContext.read.parquet()` by changing Java code generated by Catalyst. This PR is inspired by [the blog article](https://databricks.com/blog/2017/02/16/processing-trillion-rows-per-second-single-machine-can-nested-loop-joins-fast.html) and [this stackoverflow entry](http://stackoverflow.com/questions/40629435/fast-parquet-row-count-in-spark). This PR changes generated code in the following two points. 1. Replace a while-loop with long instance variables a for-loop with int local variables 2. Suppress generation of `shouldStop()` method if this method is unnecessary (e.g. `append()` is not generated). These points facilitates compiler optimizations in a JIT compiler by feeding the simplified Java code into the JIT compiler. The performance of `sqlContext.read.parquet().count` is improved by 1.09x. Benchmark program: ```java val dir = "/dev/shm/parquet" val N = 1000 * 1000 * 40 val iters = 20 val benchmark = new Benchmark("Parquet", N * iters, minNumIters = 5, warmupTime = 30.seconds) sparkSession.range(n).write.mode("overwrite").parquet(dir) benchmark.addCase("count") { i: Int => var n = 0 var len = 0L while (n < iters) { len += sparkSession.read.parquet(dir).count n += 1 } } benchmark.run ``` Performance result without this PR ``` OpenJDK 64-Bit Server VM 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13 on Linux 4.4.0-47-generic Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz Parquet: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative w/o this PR 1152 / 1211694.7 1.4 1.0X ``` Performance result with this PR ``` OpenJDK 64-Bit Server VM 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13 on Linux 4.4.0-47-generic Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz Parquet: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative with this PR 1053 / 1121760.0 1.3 1.0X ``` Here is a comparison between generated code w/o and with this PR. Only the method ```agg_doAggregateWithoutKey``` is changed. Generated code without this PR ```java /* 005 */ final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator { /* 006 */ private Object[] references; /* 007 */ private scala.collection.Iterator[] inputs; /* 008 */ private boolean agg_initAgg; /* 009 */ private boolean agg_bufIsNull; /* 010 */ private long agg_bufValue; /* 011 */ private scala.collection.Iterator scan_input; /* 012 */ private org.apache.spark.sql.execution.metric.SQLMetric scan_numOutputRows; /* 013 */ private org.apache.spark.sql.execution.metric.SQLMetric scan_scanTime; /* 014 */ private long scan_scanTime1; /* 015 */ private org.apache.spark.sql.execution.vectorized.ColumnarBatch scan_batch; /* 016 */ private int scan_batchIdx; /* 017 */ private org.apache.spark.sql.execution.metric.SQLMetric agg_numOutputRows; /* 018 */ private org.apache.spark.sql.execution.metric.SQLMetric agg_aggTime; /* 019 */ private UnsafeRow agg_result; /* 020 */ private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder agg_holder; /* 021 */ private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter agg_rowWriter; /* 022 */ /* 023 */ public GeneratedIterator(Object[] references) { /* 024 */ this.references = references; /* 025 */ } /* 026 */ /* 027 */ public void init(int index, scala.collection.Iterator[] inputs) { /* 028 */ partitionIndex = index; /* 029 */ this.inputs = inputs; /* 030 */ agg_initAgg = false; /* 031 */ /* 032 */ scan_input = inputs[0]; /* 033 */ this.scan_numOutputRows = (org.apache.spark.sql.execution.metric.SQLMetric) references[0]; /* 034 */ this.scan_scanTime = (org.apache.spark.sql.execution.metric.SQLMetric) references[1]; /* 035 */ scan_scanTime1 = 0; /* 036 */ scan_batch = null; /* 037 */ scan_batchIdx = 0; /* 038 */ this.agg_numOutputRows = (org.apache.spark.sql.execution.metric.SQLMetric) references[2]; /* 039 */ this.agg_aggTime = (org.apache.spark.sql.execution.metric.SQLMetric) r
[GitHub] spark issue #17378: [SPARK-20046][SQL] Facilitate loop optimizations in a JI...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17378 **[Test build #74996 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74996/testReport)** for PR 17378 at commit [`d74b6cf`](https://github.com/apache/spark/commit/d74b6cf5fb63479040e940e5797e0b226367b227). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17130: [SPARK-19791] [ML] Add doc and example for fpgrowth
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17130 **[Test build #74994 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74994/testReport)** for PR 17130 at commit [`9fef280`](https://github.com/apache/spark/commit/9fef280751378dbeaa843c673fd962192320a5b1). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17130: [SPARK-19791] [ML] Add doc and example for fpgrowth
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17130 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74994/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17130: [SPARK-19791] [ML] Add doc and example for fpgrowth
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17130 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17377: [SPARK-19949][SQL][FOLLOW-UP] Make parse modes as enum a...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17377 Definitely. Thanks for asking it. Let me open another PR soon for both. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17295: [SPARK-19556][core] Do not encrypt block manager data in...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17295 **[Test build #74991 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74991/testReport)** for PR 17295 at commit [`107e3e7`](https://github.com/apache/spark/commit/107e3e72e81d2c7813d832d3e9c2beab89e01379). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17295: [SPARK-19556][core] Do not encrypt block manager data in...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17295 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74991/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17295: [SPARK-19556][core] Do not encrypt block manager data in...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17295 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17336: [SPARK-20003] [ML] FPGrowthModel setMinConfidence should...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17336 **[Test build #74995 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74995/testReport)** for PR 17336 at commit [`9c046c3`](https://github.com/apache/spark/commit/9c046c3bb8dfd6dd0fa2799d434a4f92cbb1b802). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17336: [SPARK-20003] [ML] FPGrowthModel setMinConfidence should...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17336 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17336: [SPARK-20003] [ML] FPGrowthModel setMinConfidence should...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17336 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74995/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17295: [SPARK-19556][core] Do not encrypt block manager data in...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17295 **[Test build #74997 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74997/testReport)** for PR 17295 at commit [`6bda670`](https://github.com/apache/spark/commit/6bda6701bf0c266047a5fa81fd29f4fb826728c7). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17371: [SPARK-19903][PYSPARK][SS] window operator miss the `wat...
Github user marmbrus commented on the issue: https://github.com/apache/spark/pull/17371 I really think the core problem here is that we allow you to use resolved attributes at all in the user API. Unfortunately we are somewhat stuck with that bad decision. Personally, I never use `df['col']` and only ever use `col("col")` since that avoids the problem. However, I don't think that piecemeal switching to unresolved attributes is a good idea. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17166 **[Test build #74993 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74993/testReport)** for PR 17166 at commit [`884a3ad`](https://github.com/apache/spark/commit/884a3ad7308e69c0ca010c344133bcce6582920d). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class TaskKilledException(val reason: String) extends RuntimeException ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14617: [SPARK-17019][Core] Expose on-heap and off-heap memory u...
Github user ajbozarth commented on the issue: https://github.com/apache/spark/pull/14617 After reading through the previous comments I agree adding checkboxes to this page is a good idea, I would even suggest that we look at making checkboxes for a few of the current columns (default to show, to keep user compatibility)> I'm not sure which would be best but I know on many apps a few columns are never filled (Disk usage, and shuffle read/write first come to mind). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17302: [SPARK-19959][SQL] Fix to throw NullPointerException in ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17302 **[Test build #74992 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74992/testReport)** for PR 17302 at commit [`d7d0a36`](https://github.com/apache/spark/commit/d7d0a36f6b4fb78cc0a3a13f870a41b03adf882f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17166 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17166 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74993/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17302: [SPARK-19959][SQL] Fix to throw NullPointerException in ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17302 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74992/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17302: [SPARK-19959][SQL] Fix to throw NullPointerException in ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17302 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/17166 Hi @kayousterhout, Can you take over reviewing this PR ? I might be tied up with other things for next couple of weeks, and I dont want @ericl's work to be blocked on me. Thx --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/17166#discussion_r107271185 --- Diff: core/src/main/scala/org/apache/spark/TaskContextImpl.scala --- @@ -59,8 +59,8 @@ private[spark] class TaskContextImpl( /** List of callback functions to execute when the task fails. */ @transient private val onFailureCallbacks = new ArrayBuffer[TaskFailureListener] - // Whether the corresponding task has been killed. - @volatile private var interrupted: Boolean = false + // If defined, the corresponding task has been killed for the contained reason. + @volatile private var maybeKillReason: Option[String] = None --- End diff -- Yeah, the reason here is to allow this to be set atomically. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/17166#discussion_r107273296 --- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala --- @@ -302,12 +298,12 @@ private[spark] class Executor( // If this task has been killed before we deserialized it, let's quit now. Otherwise, // continue executing the task. -if (killed) { +if (maybeKillReason.isDefined) { // Throw an exception rather than returning, because returning within a try{} block // causes a NonLocalReturnControl exception to be thrown. The NonLocalReturnControl // exception will be caught by the catch block, leading to an incorrect ExceptionFailure // for the task. - throw new TaskKilledException + throw new TaskKilledException(maybeKillReason.get) --- End diff -- Fixed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/17166#discussion_r107272852 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala --- @@ -215,7 +215,8 @@ private[spark] class PythonRunner( case e: Exception if context.isInterrupted => logDebug("Exception thrown after task interruption", e) -throw new TaskKilledException +context.killTaskIfInterrupted() +null // not reached --- End diff -- Done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/17166#discussion_r107274054 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala --- @@ -467,7 +474,7 @@ private[spark] class TaskSchedulerImpl private[scheduler]( taskState: TaskState, reason: TaskFailedReason): Unit = synchronized { taskSetManager.handleFailedTask(tid, taskState, reason) -if (!taskSetManager.isZombie && taskState != TaskState.KILLED) { +if (!taskSetManager.isZombie) { --- End diff -- There is no need, but reviving offers has no effect either way. Those tasks will not be resubmitted even if reviveOffers() is called (in fact, reviveOffers() is called periodically on a timer thread, so if this was an issue we should have already seen it). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/17166#discussion_r107271262 --- Diff: core/src/main/scala/org/apache/spark/TaskContextImpl.scala --- @@ -140,16 +140,22 @@ private[spark] class TaskContextImpl( } /** Marks the task for interruption, i.e. cancellation. */ - private[spark] def markInterrupted(): Unit = { -interrupted = true + private[spark] def markInterrupted(reason: String): Unit = { +maybeKillReason = Some(reason) + } + + private[spark] override def killTaskIfInterrupted(): Unit = { +if (maybeKillReason.isDefined) { + throw new TaskKilledException(maybeKillReason.get) --- End diff -- Done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/17166#discussion_r107273498 --- Diff: core/src/main/scala/org/apache/spark/scheduler/Task.scala --- @@ -160,15 +160,20 @@ private[spark] abstract class Task[T]( // A flag to indicate whether the task is killed. This is used in case context is not yet // initialized when kill() is invoked. - @volatile @transient private var _killed = false + @volatile @transient private var _maybeKillReason: String = null --- End diff -- This one gets deserialized to null sometimes, so it seemed cleaner to use a bare string. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17355: [SPARK-19955][WIP][PySpark] Jenkins Python Conda based t...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/17355 Jenkins retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17166 **[Test build #74999 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74999/testReport)** for PR 17166 at commit [`203a900`](https://github.com/apache/spark/commit/203a90020031b71d976f60491d757c4d78b85517). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17355: [SPARK-19955][WIP][PySpark] Jenkins Python Conda based t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17355 **[Test build #74998 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74998/testReport)** for PR 17355 at commit [`267837c`](https://github.com/apache/spark/commit/267837cd741b9a1d50842e485c20033aa9b77f8f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17166 **[Test build #75000 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75000/testReport)** for PR 17166 at commit [`6e8593b`](https://github.com/apache/spark/commit/6e8593b9bb88a2b0bf90e39887368cc4535480b6). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16596: [SPARK-19237][SPARKR][CORE] On Windows spark-submit shou...
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/16596 Ah we are only changing this on Windows - I see. This is a lower risk change then. LGTM. Merging this to master, branch-2.1 cc @HyukjinKwon --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17361: [SPARK-20030][SS] Event-time-based timeout for MapGroups...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17361 **[Test build #75001 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75001/testReport)** for PR 17361 at commit [`6759165`](https://github.com/apache/spark/commit/6759165f9b6d26c87b94e7acc40914ae4ca37a89). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17379: [SPARK-20048][SQL] Cloning SessionState does not ...
GitHub user kunalkhamar opened a pull request: https://github.com/apache/spark/pull/17379 [SPARK-20048][SQL] Cloning SessionState does not clone query execution listeners ## What changes were proposed in this pull request? Bugfix from SPARK-19540. Cloning SessionState does not clone query execution listeners, so cloned session is unable to listen to events on queries. ## How was this patch tested? - Unit test You can merge this pull request into a Git repository by running: $ git pull https://github.com/kunalkhamar/spark clone-bugfix Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17379.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17379 commit ad77fe9ad258eac224f069bbc89294818ee6b549 Author: Kunal Khamar Date: 2017-03-21T21:16:04Z Fix cloning of listener manager. Remove redundant comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17166 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17166 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74999/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17166 **[Test build #74999 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74999/testReport)** for PR 17166 at commit [`203a900`](https://github.com/apache/spark/commit/203a90020031b71d976f60491d757c4d78b85517). * This patch **fails MiMa tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17379: [SPARK-20048][SQL] Cloning SessionState does not clone q...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17379 **[Test build #75002 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75002/testReport)** for PR 17379 at commit [`ad77fe9`](https://github.com/apache/spark/commit/ad77fe9ad258eac224f069bbc89294818ee6b549). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17166 **[Test build #75000 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75000/testReport)** for PR 17166 at commit [`6e8593b`](https://github.com/apache/spark/commit/6e8593b9bb88a2b0bf90e39887368cc4535480b6). * This patch **fails MiMa tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17166 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75000/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17166 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16596: [SPARK-19237][SPARKR][CORE] On Windows spark-subm...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/16596 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17361: [SPARK-20030][SS] Event-time-based timeout for MapGroups...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17361 **[Test build #75003 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75003/testReport)** for PR 17361 at commit [`d0758eb`](https://github.com/apache/spark/commit/d0758ebd6b78c6cde97e9750275a0fbba93da764). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15899: [SPARK-18466] added withFilter method to RDD
Github user danielyli commented on the issue: https://github.com/apache/spark/pull/15899 Hello, I found this issue after encountering the error `'withFilter' method does not yet exist on RDD[(Int, Double)], using 'filter' method instead` in my code. I'm writing a somewhat complicated `flatMap`-`flatMap`-`map` expression involving pair RDDs, and the code is becoming busy enough that sugaring them into a `for` expression is warranted for readability. Since I'm not using any `filter`s or `if`s in the `for` expression, I found the above error message puzzling. After some tinkering, I think I've found a minimal reproducible case: ```scala for ((k, v) <- pairRdd) yield ...// pairRdd is of type RDD[(_, _)] ``` Curiously, the `withFilter` error doesn't occur if I write `for (x <- pairRdd) yield ...`. @rxin, do you have any insight into this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17378: [SPARK-20046][SQL] Facilitate loop optimizations in a JI...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17378 **[Test build #74996 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74996/testReport)** for PR 17378 at commit [`d74b6cf`](https://github.com/apache/spark/commit/d74b6cf5fb63479040e940e5797e0b226367b227). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17378: [SPARK-20046][SQL] Facilitate loop optimizations in a JI...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17378 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17378: [SPARK-20046][SQL] Facilitate loop optimizations in a JI...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17378 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74996/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/9 **[Test build #75004 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75004/consoleFull)** for PR 9 at commit [`6f169eb`](https://github.com/apache/spark/commit/6f169ebf8c0c832010d2dbd8f971cfabff7870f2). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth
Github user zero323 commented on a diff in the pull request: https://github.com/apache/spark/pull/17170#discussion_r107281316 --- Diff: R/pkg/R/mllib_fpm.R --- @@ -0,0 +1,153 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# mllib_fpm.R: Provides methods for MLlib frequent pattern mining algorithms integration + +#' S4 class that represents a FPGrowthModel +#' +#' @param jobj a Java object reference to the backing Scala FPGrowthModel +#' @export +#' @note FPGrowthModel since 2.2.0 +setClass("FPGrowthModel", slots = list(jobj = "jobj")) + +#' FPGrowth +#' +#' A parallel FP-growth algorithm to mine frequent itemsets. The algorithm is described in +#' Li et al., PFP: Parallel FP-Growth for Query +#' Recommendation <\url{http://dx.doi.org/10.1145/1454008.1454027}>. +#' PFP distributes computation in such a way that each worker executes an +#' independent group of mining tasks. The FP-Growth algorithm is described in +#' Han et al., Mining frequent patterns without +#' candidate generation <\url{http://dx.doi.org/10.1145/335191.335372}>. +#' +#' @param data A SparkDataFrame for training. +#' @param minSupport Minimal support level. +#' @param minConfidence Minimal confidence level. +#' @param itemsCol Items column name. +#' @param numPartitions Number of partitions used for fitting. +#' @param ... additional argument(s) passed to the method. +#' @return \code{spark.fpGrowth} returns a fitted FPGrowth model. +#' @rdname spark.fpGrowth +#' @name spark.fpGrowth +#' @aliases spark.fpGrowth,SparkDataFrame-method +#' @export +#' @examples +#' \dontrun{ +#' raw_data <- read.df( +#' "data/mllib/sample_fpgrowth.txt", +#' source = "csv", +#' schema = structType(structField("raw_items", "string"))) +#' +#' data <- selectExpr(raw_data, "split(raw_items, ' ') as items") +#' model <- spark.fpGrowth(data) +#' +#' # Show frequent itemsets +#' frequent_itemsets <- spark.freqItemsets(model) +#' showDF(frequent_itemsets) +#' +#' # Show association rules +#' association_rules <- spark.associationRules(model) +#' showDF(association_rules) +#' +#' # Predict on new data +#' new_itemsets <- data.frame(items = c("t", "t,s")) +#' new_data <- selectExpr(createDataFrame(new_itemsets), "split(items, ',') as items") +#' predict(model, new_data) +#' +#' # Save and load model +#' path <- "/path/to/model" +#' write.ml(model, path) +#' read.ml(path) +#' +#' # Optional arguments +#' baskets_data <- selectExpr(createDataFrame(itemsets), "split(items, ',') as baskets") +#' another_model <- spark.fpGrowth(data, minSupport = 0.1, minConfidence = 0.5 +#' itemsCol = "baskets", numPartitions = 10) +#' } +#' @references \url{http://en.wikipedia.org/wiki/Association_rule_learning} +#' @note spark.fpGrowth since 2.2.0 +setMethod("spark.fpGrowth", signature(data = "SparkDataFrame"), + function(data, minSupport = 0.3, minConfidence = 0.8, + itemsCol = "items", numPartitions = -1) { --- End diff -- Correct if I am wrong but this cannot be done like this. If we want to default to `NULL` (I am not fond of this idea) we have to pass argument as a `character` / `String` and parse it once in JVM. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth
Github user zero323 commented on a diff in the pull request: https://github.com/apache/spark/pull/17170#discussion_r107281460 --- Diff: R/pkg/R/mllib_fpm.R --- @@ -0,0 +1,153 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# mllib_fpm.R: Provides methods for MLlib frequent pattern mining algorithms integration + +#' S4 class that represents a FPGrowthModel +#' +#' @param jobj a Java object reference to the backing Scala FPGrowthModel +#' @export +#' @note FPGrowthModel since 2.2.0 +setClass("FPGrowthModel", slots = list(jobj = "jobj")) + +#' FPGrowth +#' +#' A parallel FP-growth algorithm to mine frequent itemsets. The algorithm is described in +#' Li et al., PFP: Parallel FP-Growth for Query +#' Recommendation <\url{http://dx.doi.org/10.1145/1454008.1454027}>. +#' PFP distributes computation in such a way that each worker executes an +#' independent group of mining tasks. The FP-Growth algorithm is described in +#' Han et al., Mining frequent patterns without +#' candidate generation <\url{http://dx.doi.org/10.1145/335191.335372}>. +#' +#' @param data A SparkDataFrame for training. +#' @param minSupport Minimal support level. +#' @param minConfidence Minimal confidence level. +#' @param itemsCol Items column name. +#' @param numPartitions Number of partitions used for fitting. +#' @param ... additional argument(s) passed to the method. +#' @return \code{spark.fpGrowth} returns a fitted FPGrowth model. +#' @rdname spark.fpGrowth +#' @name spark.fpGrowth +#' @aliases spark.fpGrowth,SparkDataFrame-method +#' @export +#' @examples +#' \dontrun{ +#' raw_data <- read.df( +#' "data/mllib/sample_fpgrowth.txt", +#' source = "csv", +#' schema = structType(structField("raw_items", "string"))) +#' +#' data <- selectExpr(raw_data, "split(raw_items, ' ') as items") +#' model <- spark.fpGrowth(data) +#' +#' # Show frequent itemsets +#' frequent_itemsets <- spark.freqItemsets(model) +#' showDF(frequent_itemsets) +#' +#' # Show association rules +#' association_rules <- spark.associationRules(model) +#' showDF(association_rules) +#' +#' # Predict on new data +#' new_itemsets <- data.frame(items = c("t", "t,s")) +#' new_data <- selectExpr(createDataFrame(new_itemsets), "split(items, ',') as items") +#' predict(model, new_data) +#' +#' # Save and load model +#' path <- "/path/to/model" +#' write.ml(model, path) +#' read.ml(path) +#' +#' # Optional arguments +#' baskets_data <- selectExpr(createDataFrame(itemsets), "split(items, ',') as baskets") +#' another_model <- spark.fpGrowth(data, minSupport = 0.1, minConfidence = 0.5 +#' itemsCol = "baskets", numPartitions = 10) +#' } +#' @references \url{http://en.wikipedia.org/wiki/Association_rule_learning} --- End diff -- I'll remove it completely and just link to the docs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17088: [SPARK-19753][CORE] Un-register all shuffle outpu...
Github user markhamstra commented on a diff in the pull request: https://github.com/apache/spark/pull/17088#discussion_r107283229 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -1331,7 +1328,20 @@ class DAGScheduler( // TODO: mark the executor as failed only if there were lots of fetch failures on it if (bmAddress != null) { -handleExecutorLost(bmAddress.executorId, filesLost = true, Some(task.epoch)) +val hostToUnregisterOutputs = if (env.blockManager.externalShuffleServiceEnabled) { + // We had a fetch failure with the external shuffle service, so we + // assume all shuffle data on the node is bad. + Some(bmAddress.host) +} else { + // Deregister shuffle data just for one executor (we don't have any --- End diff -- nit: "Unregister" is used elsewhere (function names, etc.), not "deregister". --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16826: [SPARK-19540][SQL] Add ability to clone SparkSess...
Github user kunalkhamar commented on a diff in the pull request: https://github.com/apache/spark/pull/16826#discussion_r107283640 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala --- @@ -17,43 +17,70 @@ package org.apache.spark.sql.internal -import java.io.File - import org.apache.hadoop.conf.Configuration -import org.apache.hadoop.fs.Path +import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql._ -import org.apache.spark.sql.catalyst.TableIdentifier import org.apache.spark.sql.catalyst.analysis.{Analyzer, FunctionRegistry} import org.apache.spark.sql.catalyst.catalog._ import org.apache.spark.sql.catalyst.optimizer.Optimizer import org.apache.spark.sql.catalyst.parser.ParserInterface import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan +import org.apache.spark.sql.catalyst.rules.Rule import org.apache.spark.sql.execution._ -import org.apache.spark.sql.execution.command.AnalyzeTableCommand import org.apache.spark.sql.execution.datasources._ -import org.apache.spark.sql.streaming.{StreamingQuery, StreamingQueryManager} +import org.apache.spark.sql.streaming.StreamingQueryManager import org.apache.spark.sql.util.ExecutionListenerManager /** * A class that holds all session-specific state in a given [[SparkSession]]. + * @param functionRegistry Internal catalog for managing functions registered by the user. + * @param catalog Internal catalog for managing table and database states. + * @param sqlParser Parser that extracts expressions, plans, table identifiers etc. from SQL texts. + * @param analyzer Logical query plan analyzer for resolving unresolved attributes and relations. + * @param streamingQueryManager Interface to start and stop + * [[org.apache.spark.sql.streaming.StreamingQuery]]s. + * @param queryExecutionCreator Lambda to create a [[QueryExecution]] from a [[LogicalPlan]] --- End diff -- @rxin Removing the redundant comments in [SPARK-20048](https://github.com/apache/spark/pull/17379). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17088: [SPARK-19753][CORE] Un-register all shuffle outpu...
Github user markhamstra commented on a diff in the pull request: https://github.com/apache/spark/pull/17088#discussion_r107284085 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -1389,8 +1423,7 @@ class DAGScheduler( clearCacheLocs() } } else { - logDebug("Additional executor lost message for " + execId + - "(epoch " + currentEpoch + ")") + logDebug("Additional executor lost message for %s (epoch %d)".format(execId, currentEpoch)) --- End diff -- nit: prefer string interpolation over `format`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17088: [SPARK-19753][CORE] Un-register all shuffle outpu...
Github user markhamstra commented on a diff in the pull request: https://github.com/apache/spark/pull/17088#discussion_r107284202 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -1683,11 +1716,12 @@ private[scheduler] class DAGSchedulerEventProcessLoop(dagScheduler: DAGScheduler dagScheduler.handleExecutorAdded(execId, host) case ExecutorLost(execId, reason) => - val filesLost = reason match { -case SlaveLost(_, true) => true + val workerLost = reason match { +case SlaveLost(_, true) => + true --- End diff -- nit: prefer it without the line break for something this simple --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17108: [SPARK-19636][ML] Feature parity for correlation ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/17108#discussion_r107283840 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Correlations.scala --- @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.stat + +import scala.collection.JavaConverters._ + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.linalg.{SQLDataTypes, Vector} +import org.apache.spark.mllib.linalg.{Vectors => OldVectors} +import org.apache.spark.mllib.stat.{Statistics => OldStatistics} +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.types.{StructField, StructType} + +/** + * API for statistical functions in MLlib, compatible with Dataframes and Datasets. + * + * The functions in this package generalize the functions in [[org.apache.spark.sql.Dataset.stat]] + * to spark.ml's Vector types. + */ +@Since("2.2.0") +@Experimental +object Correlations { + + /** + * Compute the correlation matrix for the input RDD of Vectors using the specified method. + * Methods currently supported: `pearson` (default), `spearman`. + * + * @param dataset A dataset or a dataframe + * @param column The name of the column of vectors for which the correlation coefficient needs + * to be computed. This must be a column of the dataset, and it must contain + * Vector objects. + * @param method String specifying the method to use for computing correlation. + * Supported: `pearson` (default), `spearman` + * @return A dataframe that contains the correlation matrix of the column of vectors. This + * dataframe contains a single row and a single column of name + * '$METHODNAME($COLUMN)'. + * @throws IllegalArgumentException if the column is not a valid column in the dataset, or if + * the content of this column is not of type Vector. + * + * Here is how to access the correlation coefficient: + * {{{ + *val data: Dataset[Vector] = ... + *val Row(coeff: Matrix) = Statistics.corr(data, "value").head + *// coeff now contains the Pearson correlation matrix. + * }}} + * + * @note For Spearman, a rank correlation, we need to create an RDD[Double] for each column + * and sort it in order to retrieve the ranks and then join the columns back into an RDD[Vector], + * which is fairly costly. Cache the input RDD before calling corr with `method = "spearman"` to + * avoid recomputing the common lineage. + */ + @Since("2.2.0") + def corr(dataset: Dataset[_], column: String, method: String): DataFrame = { +val rdd = dataset.select(column).rdd.map { + case Row(v: Vector) => OldVectors.fromML(v) +} +val oldM = OldStatistics.corr(rdd, method) +val name = s"$method($column)" +val schema = StructType(Array(StructField(name, SQLDataTypes.MatrixType, nullable = true))) +dataset.sparkSession.createDataFrame(Seq(Row(oldM.asML)).asJava, schema) + } + + /** + * Compute the correlation matrix for the input Dataset of Vectors. --- End diff -- Just say that this is a version of corr which defaults to "pearson" for the method. Don't document params or return value. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17108: [SPARK-19636][ML] Feature parity for correlation ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/17108#discussion_r107074472 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Correlations.scala --- @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.stat + +import scala.collection.JavaConverters._ + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.linalg.{SQLDataTypes, Vector} +import org.apache.spark.mllib.linalg.{Vectors => OldVectors} +import org.apache.spark.mllib.stat.{Statistics => OldStatistics} +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.types.{StructField, StructType} + +/** + * API for statistical functions in MLlib, compatible with Dataframes and Datasets. --- End diff -- This should be limited to correlations --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17108: [SPARK-19636][ML] Feature parity for correlation ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/17108#discussion_r107075473 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Correlations.scala --- @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.stat + +import scala.collection.JavaConverters._ + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.linalg.{SQLDataTypes, Vector} +import org.apache.spark.mllib.linalg.{Vectors => OldVectors} +import org.apache.spark.mllib.stat.{Statistics => OldStatistics} +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.types.{StructField, StructType} + +/** + * API for statistical functions in MLlib, compatible with Dataframes and Datasets. + * + * The functions in this package generalize the functions in [[org.apache.spark.sql.Dataset.stat]] + * to spark.ml's Vector types. + */ +@Since("2.2.0") +@Experimental +object Correlations { --- End diff -- How about calling it "Correlation" (singular)? Especially if we add a builder pattern, then I feel like ```new Correlation().set...``` seems more natural. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org