[GitHub] spark pull request #22455: [SPARK-24572][SPARKR] "eager execution" for R she...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22455#discussion_r226874142 --- Diff: R/pkg/tests/fulltests/test_eager_execution.R --- @@ -0,0 +1,52 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +library(testthat) + +context("Show SparkDataFrame when eager execution is enabled.") + +test_that("eager execution is not enabled", { + # Start Spark session without eager execution enabled + sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE) + + df <- createDataFrame(faithful) + expect_is(df, "SparkDataFrame") + expected <- "eruptions:double, waiting:double" + expect_output(show(df), expected) + + # Stop Spark session + sparkR.session.stop() +}) + +test_that("eager execution is enabled", { + # Start Spark session with eager execution enabled + sparkConfig <- list(spark.sql.repl.eagerEval.enabled = "true", + spark.sql.repl.eagerEval.maxNumRows = as.integer(10)) --- End diff -- add a test with `spark.sql.repl.eagerEval.maxNumRows` not set. change this test to add `spark.sql.repl.eagerEval.truncate` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22455: [SPARK-24572][SPARKR] "eager execution" for R she...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22455#discussion_r226874309 --- Diff: R/pkg/tests/fulltests/test_eager_execution.R --- @@ -0,0 +1,52 @@ +# --- End diff -- one thing to note, since test runs alphebatically, this test will run before sparkSQL - I think we should rename this to test_sparkSQL_eager.R to ensure it runs after --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22455: [SPARK-24572][SPARKR] "eager execution" for R she...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22455#discussion_r226874251 --- Diff: R/pkg/tests/fulltests/test_sparkSQL.R --- @@ -3731,6 +3731,7 @@ test_that("catalog APIs, listTables, listColumns, listFunctions", { dropTempView("cars") }) + --- End diff -- remove empty line --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22668: [SPARK-25675] [Spark Job History] Job UI page does not s...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22668 @shivusondur what's your account name on https://issues.apache.org? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22668: [SPARK-25675] [Spark Job History] Job UI page does not s...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22668 merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [SPARK-25675][SPARK JOB HISTORY] Job UI page does not show pagination with one page
Repository: spark Updated Branches: refs/heads/master ffe256ce1 -> 4c6c6711d [SPARK-25675][SPARK JOB HISTORY] Job UI page does not show pagination with one page ## What changes were proposed in this pull request? Currently in PagedTable.scala pageNavigation() method, if it is having only one page, they were not using the pagination. Now it is made to use the pagination, even if it is having one page. ## How was this patch tested? This tested with Spark webUI and History page in spark local setup. ![pagination](https://user-images.githubusercontent.com/7912929/46592799-93bfaf00-cae3-11e8-881a-ca2e93f17818.png) Author: shivusondur Closes #22668 from shivusondur/pagination. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4c6c6711 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4c6c6711 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4c6c6711 Branch: refs/heads/master Commit: 4c6c6711d5e94378c7bb5f672314f4db751372ea Parents: ffe256c Author: shivusondur Authored: Sun Oct 21 11:44:48 2018 -0700 Committer: Felix Cheung Committed: Sun Oct 21 11:44:48 2018 -0700 -- .../scala/org/apache/spark/ui/PagedTable.scala | 210 +-- 1 file changed, 101 insertions(+), 109 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/4c6c6711/core/src/main/scala/org/apache/spark/ui/PagedTable.scala -- diff --git a/core/src/main/scala/org/apache/spark/ui/PagedTable.scala b/core/src/main/scala/org/apache/spark/ui/PagedTable.scala index 2fc0259..0bbb10a 100644 --- a/core/src/main/scala/org/apache/spark/ui/PagedTable.scala +++ b/core/src/main/scala/org/apache/spark/ui/PagedTable.scala @@ -122,13 +122,9 @@ private[spark] trait PagedTable[T] { /** * Return a page navigation. - * - * If the totalPages is 1, the page navigation will be empty - * - * If the totalPages is more than 1, it will create a page navigation including a group of - * page numbers and a form to submit the page number. - * - * + * + * It will create a page navigation including a group of page numbers and a form + * to submit the page number. * * Here are some examples of the page navigation: * {{{ @@ -154,120 +150,116 @@ private[spark] trait PagedTable[T] { * }}} */ private[ui] def pageNavigation(page: Int, pageSize: Int, totalPages: Int): Seq[Node] = { -if (totalPages == 1) { - Nil -} else { - // A group includes all page numbers will be shown in the page navigation. - // The size of group is 10 means there are 10 page numbers will be shown. - // The first group is 1 to 10, the second is 2 to 20, and so on - val groupSize = 10 - val firstGroup = 0 - val lastGroup = (totalPages - 1) / groupSize - val currentGroup = (page - 1) / groupSize - val startPage = currentGroup * groupSize + 1 - val endPage = totalPages.min(startPage + groupSize - 1) - val pageTags = (startPage to endPage).map { p => -if (p == page) { - // The current page should be disabled so that it cannot be clicked. - {p} -} else { - {p} -} +// A group includes all page numbers will be shown in the page navigation. +// The size of group is 10 means there are 10 page numbers will be shown. +// The first group is 1 to 10, the second is 2 to 20, and so on +val groupSize = 10 +val firstGroup = 0 +val lastGroup = (totalPages - 1) / groupSize +val currentGroup = (page - 1) / groupSize +val startPage = currentGroup * groupSize + 1 +val endPage = totalPages.min(startPage + groupSize - 1) +val pageTags = (startPage to endPage).map { p => + if (p == page) { +// The current page should be disabled so that it cannot be clicked. +{p} + } else { +{p} } +} - val hiddenFormFields = { -if (goButtonFormPath.contains('?')) { - val queryString = goButtonFormPath.split("\\?", 2)(1) - val search = queryString.split("#")(0) - Splitter -.on('&') -.trimResults() -.omitEmptyStrings() -.withKeyValueSeparator("=") -.split(search) -.asScala -.filterKeys(_ != pageSizeFormField) -.filterKeys(_ != prevPageSizeFormField) -.filterKeys(_ != pageNumberFormField) -.mapValues(URLDecoder.decode(_, "UTF-8")) -.map { case (k, v) => - -} -} else { - Seq.empty -} +val hiddenFormFields = { + if (goButtonFormPath.contains('?')) { +val queryString = goButtonFormPath.split("\\?", 2)(1) +val search =
[GitHub] spark issue #22720: [SPARK-25730][K8S] Delete executor pods from kubernetes ...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22720 merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [SPARK-25730][K8S] Delete executor pods from kubernetes after figuring out why they died
Repository: spark Updated Branches: refs/heads/master c77aa42f5 -> ffe256ce1 [SPARK-25730][K8S] Delete executor pods from kubernetes after figuring out why they died ## What changes were proposed in this pull request? `removeExecutorFromSpark` tries to fetch the reason the executor exited from Kubernetes, which may be useful if the pod was OOMKilled. However, the code previously deleted the pod from Kubernetes first which made retrieving this status impossible. This fixes the ordering. On a separate but related note, it would be nice to wait some time before removing the pod - to let the operator examine logs and such. ## How was this patch tested? Running on my local cluster. Author: Mike Kaplinskiy Closes #22720 from mikekap/patch-1. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ffe256ce Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ffe256ce Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ffe256ce Branch: refs/heads/master Commit: ffe256ce161884f0a1304b4925d51d39a9bfa5df Parents: c77aa42 Author: Mike Kaplinskiy Authored: Sun Oct 21 11:32:33 2018 -0700 Committer: Felix Cheung Committed: Sun Oct 21 11:32:33 2018 -0700 -- .../spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ffe256ce/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala -- diff --git a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala index cc254b8..1a75ae0 100644 --- a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala +++ b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala @@ -112,8 +112,8 @@ private[spark] class ExecutorPodsLifecycleManager( execId: Long, schedulerBackend: KubernetesClusterSchedulerBackend, execIdsRemovedInRound: mutable.Set[Long]): Unit = { -removeExecutorFromK8s(podState.pod) removeExecutorFromSpark(schedulerBackend, podState, execId) +removeExecutorFromK8s(podState.pod) execIdsRemovedInRound += execId } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] spark pull request #22771: [SPARK-25773][Core]Cancel zombie tasks in a resul...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22771#discussion_r226818782 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -1364,6 +1385,16 @@ private[spark] class DAGScheduler( if (job.numFinished == job.numPartitions) { markStageAsFinished(resultStage) cleanupStateForJobAndIndependentStages(job) +try { // cancelTasks will fail if a SchedulerBackend does not implement killTask + logInfo( +s"Job ${job.jobId} is finished. Killing speculative tasks for this job") + // ResultStage is only used by this job. It's safe to kill speculative or + // zombie tasks in this stage. + taskScheduler.cancelTasks(stageId, shouldInterruptTaskThread(job)) +} catch { + case e: UnsupportedOperationException => +logInfo(s"Could not cancel tasks for stage $stageId", e) --- End diff -- logWarn? aren't we leaking tasks? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] zeppelin issue #3206: [WIP][ZEPPELIN-3810] Support Spark 2.4
Github user felixcheung commented on the issue: https://github.com/apache/zeppelin/pull/3206 is this going to break, say spark 2.3 with scala 2.11.8? ---
[GitHub] spark pull request #22753: [SPARK-25754][DOC] Change CDN for MathJax
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22753#discussion_r225792435 --- Diff: docs/_layouts/global.html --- @@ -184,7 +184,8 @@ {{ page.title }} }); }; script.src = ('https:' == document.location.protocol ? 'https://' : 'http://') + - 'cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'; + 'cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js' + --- End diff -- should we avoid hardcoding the version inline? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] zeppelin issue #3013: [ZEPPELIN-3511] remove old button "Download Data as CS...
Github user felixcheung commented on the issue: https://github.com/apache/zeppelin/pull/3013 ok, let's go ahead then? ---
[GitHub] spark issue #22608: [SPARK-23257][K8S][TESTS] Kerberos Support Integration T...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22608 I think the suggestion is to go to master & 2.4 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22608: [SPARK-23257][K8S][TESTS] Kerberos Support Integr...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22608#discussion_r225615864 --- Diff: bin/docker-image-tool.sh --- @@ -71,18 +71,29 @@ function build { --build-arg base_img=$(image_ref spark) ) - local BASEDOCKERFILE=${BASEDOCKERFILE:-"$IMG_PATH/spark/Dockerfile"} - local PYDOCKERFILE=${PYDOCKERFILE:-"$IMG_PATH/spark/bindings/python/Dockerfile"} - local RDOCKERFILE=${RDOCKERFILE:-"$IMG_PATH/spark/bindings/R/Dockerfile"} + local BASEDOCKERFILE=${BASEDOCKERFILE:-"$IMG_PATH/main/dockerfiles/spark/Dockerfile"} + local PYDOCKERFILE=${PYDOCKERFILE:-"$IMG_PATH/main/dockerfiles/spark/bindings/python/Dockerfile"} + local KDOCKERFILE=${KDOCKERFILE:-"$IMG_PATH/test/dockerfiles/spark/kerberos/Dockerfile"} + local RDOCKERFILE=${RDOCKERFILE:-"$IMG_PATH/main/dockerfiles/spark/bindings/R/Dockerfile"} + # Spark Base docker build $NOCACHEARG "${BUILD_ARGS[@]}" \ -t $(image_ref spark) \ -f "$BASEDOCKERFILE" . + # PySpark docker build $NOCACHEARG "${BINDING_BUILD_ARGS[@]}" \ -t $(image_ref spark-py) \ -f "$PYDOCKERFILE" . + # The following are optional docker builds for Kerberos Testing + docker pull ifilonenko/hadoop-base:latest --- End diff -- I comment on this earlier... https://github.com/apache/spark/pull/22608#issuecomment-427617701 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22730: [SPARK-16775][CORE] Remove deprecated accumulator v1 API...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22730 no accumulator in R --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22721: [SPARK-25403][SQL] Refreshes the table after inse...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22721#discussion_r225418234 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala --- @@ -189,6 +189,7 @@ case class InsertIntoHadoopFsRelationCommand( sparkSession.catalog.refreshByPath(outputPath.toString) if (catalogTable.nonEmpty) { + sparkSession.catalog.refreshTable(catalogTable.get.identifier.quotedString) --- End diff -- btw, I thought the cost of refreshing the table can be quite high? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22720: [SPARK-25730][K8S] Delete executor pods from kubernetes ...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22720 cool, any more comment? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22727: [SPARK-25735][CORE][MINOR]Improve start-thriftser...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22727#discussion_r225417566 --- Diff: sbin/start-thriftserver.sh --- @@ -39,6 +39,10 @@ function usage { pattern+="\|Spark Command: " pattern+="\|===" pattern+="\|--help" + pattern+="\|Using Spark's default log4j profile:" + pattern+="\|^log4j:" + pattern+="\|Started daemon with process name" + pattern+="\|Registered signal handler for" "${SPARK_HOME}"/bin/spark-submit --help 2>&1 | grep -v Usage 1>&2 --- End diff -- it;s kinda weird that `--help` will try to create the context? `ERROR SparkContext:91 - Error initializing SparkContext.`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] zeppelin issue #3013: [ZEPPELIN-3511] remove old button "Download Data as CS...
Github user felixcheung commented on the issue: https://github.com/apache/zeppelin/pull/3013 possibly? any objection from reviewers of this change/PR? ---
[GitHub] spark issue #22720: [K8S] Delete executor pods from kubernetes after figurin...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22720 can you open a JIRA --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22720: [K8S] Delete executor pods from kubernetes after figurin...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22720 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22714: [SPARK-25720][WEBUI] Support auto refresh page for the W...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22714 can this be passed in as a param in the url? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] zeppelin issue #3013: [ZEPPELIN-3511] remove old button "Download Data as CS...
Github user felixcheung commented on the issue: https://github.com/apache/zeppelin/pull/3013 any update? ---
[GitHub] spark pull request #22379: [SPARK-25393][SQL] Adding new function from_csv()
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22379#discussion_r224949068 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala --- @@ -3854,6 +3854,38 @@ object functions { @scala.annotation.varargs def map_concat(cols: Column*): Column = withExpr { MapConcat(cols.map(_.expr)) } + /** + * Parses a column containing a CSV string into a `StructType` with the specified schema. + * Returns `null`, in the case of an unparseable string. + * + * @param e a string column containing CSV data. + * @param schema the schema to use when parsing the CSV string + * @param options options to control how the CSV is parsed. accepts the same options and the + *CSV data source. + * + * @group collection_funcs + * @since 3.0.0 + */ + def from_csv(e: Column, schema: StructType, options: Map[String, String]): Column = withExpr { +CsvToStructs(schema, options, e.expr) + } + + /** + * (Java-specific) Parses a column containing a CSV string into a `StructType` + * with the specified schema. Returns `null`, in the case of an unparseable string. + * + * @param e a string column containing CSV data. + * @param schema the schema to use when parsing the CSV string + * @param options options to control how the CSV is parsed. accepts the same options and the + *CSV data source. + * + * @group collection_funcs + * @since 3.0.0 + */ + def from_csv(e: Column, schema: String, options: java.util.Map[String, String]): Column = { --- End diff -- you can create a new type characterOrstructOrColumn? (letter casing is weird) but I wasn't following - what's the case for schema is a Column? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22668: [SPARK-25675] [Spark Job History] Job UI page doe...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22668#discussion_r224316034 --- Diff: core/src/main/scala/org/apache/spark/ui/PagedTable.scala --- @@ -123,10 +123,9 @@ private[ui] trait PagedTable[T] { /** * Return a page navigation. * - * If the totalPages is 1, the page navigation will be empty * - * If the totalPages is more than 1, it will create a page navigation including a group of - * page numbers and a form to submit the page number. + * It will create a page navigation including a group of page numbers and a form --- End diff -- true. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22689: [SPARK-25697][CORE]When zstd compression enabled, InProg...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22689 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22681: [SPARK-25682][k8s] Package example jars in same t...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22681#discussion_r224314585 --- Diff: resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile --- @@ -18,6 +18,7 @@ FROM openjdk:8-alpine ARG spark_jars=jars +ARG example_jars=examples/jars --- End diff -- could we make this optional? if someone wants to build a smaller image without example --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21710: [SPARK-24207][R]add R API for PrefixSpan
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/21710#discussion_r223574367 --- Diff: R/pkg/R/mllib_fpm.R --- @@ -154,3 +160,74 @@ setMethod("write.ml", signature(object = "FPGrowthModel", path = "character"), function(object, path, overwrite = FALSE) { write_internal(object, path, overwrite) }) + +#' PrefixSpan +#' +#' A parallel PrefixSpan algorithm to mine frequent sequential patterns. +#' \code{spark.prefixSpan} returns an instance of PrefixSpan. +#' \code{spark.findFrequentSequentialPatterns} returns a complete set of frequent sequential +#' patterns. +#' For more details, see +#' \href{https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html#prefixspan}{ +#' PrefixSpan}. +#' +#' @param minSupport Minimal support level. +#' @param maxPatternLength Maximal pattern length. +#' @param maxLocalProjDBSize Maximum number of items (including delimiters used in the internal +#' storage format) allowed in a projected database before local +#' processing. +#' @param sequenceCol name of the sequence column in dataset. +#' @param ... additional argument(s) passed to the method. +#' @return \code{spark.prefixSpan} returns an instance of PrefixSpan +#' @rdname spark.prefixSpan +#' @name spark.prefixSpan +#' @aliases spark.prefixSpan,ANY-method +#' @examples +#' \dontrun{ +#' df <- createDataFrame(list(list(list(list(1L, 2L), list(3L))), +#' list(list(list(1L), list(3L, 2L), list(1L, 2L))), +#' list(list(list(1L, 2L), list(5L))), +#' list(list(list(6L, schema = c("sequence")) +#' prefix_Span <- spark.prefixSpan(minSupport = 0.5, maxPatternLength = 5L, +#' maxLocalProjDBSize = 3200L) +#' frequency <- spark.findFrequentSequentialPatterns(prefix_Span, df) +#' showDF(frequency) +#' } +#' @note spark.prefixSpan since 3.0.0 +setMethod("spark.prefixSpan", signature(), + function(minSupport=0.1, maxPatternLength=10L, + maxLocalProjDBSize=3200L, sequenceCol="sequence") { +if (!is.numeric(minSupport) || minSupport < 0) { + stop("minSupport should be a number with value >= 0.") +} +if (!is.integer(maxPatternLength) || maxPatternLength <= 0) { + stop("maxPatternLength should be a number with value > 0.") +} +if (!is.numeric(maxLocalProjDBSize) || maxLocalProjDBSize <= 0) { + stop("maxLocalProjDBSize should be a number with value > 0.") +} + +jobj <- callJStatic("org.apache.spark.ml.r.PrefixSpanWrapper", "getPrefixSpan", +as.numeric(minSupport), as.integer(maxPatternLength), +as.numeric(maxLocalProjDBSize), as.character(sequenceCol)) +new("PrefixSpan", jobj = jobj) + }) + +# Find frequent sequential patterns. + +#' @param object a PrefixSpan object. +#' @param data A SparkDataFrame. +#' @return A complete set of frequent sequential patterns in the input sequences of itemsets. +#' The returned \code{SparkDataFrame} contains columns of sequence and corresponding +#' frequency. The schema of it will be: +#' \code{sequence: ArrayType(ArrayType(T))} (T is the item type) +#' \code{freq: Long} +#' @rdname spark.prefixSpan +#' @aliases findFrequentSequentialPatterns,PrefixSpan,SparkDataFrame-method +#' @note spark.findFrequentSequentialPatterns(PrefixSpan, SparkDataFrame) since 3.0.0 + +setMethod("spark.findFrequentSequentialPatterns", + signature(object = "PrefixSpan", data = "SparkDataFrame"), --- End diff -- nm, see the other comment instead --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21710: [SPARK-24207][R]add R API for PrefixSpan
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/21710#discussion_r223574019 --- Diff: R/pkg/R/mllib_fpm.R --- @@ -154,3 +160,74 @@ setMethod("write.ml", signature(object = "FPGrowthModel", path = "character"), function(object, path, overwrite = FALSE) { write_internal(object, path, overwrite) }) + +#' PrefixSpan +#' +#' A parallel PrefixSpan algorithm to mine frequent sequential patterns. +#' \code{spark.prefixSpan} returns an instance of PrefixSpan. +#' \code{spark.findFrequentSequentialPatterns} returns a complete set of frequent sequential +#' patterns. +#' For more details, see +#' \href{https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html#prefixspan}{ +#' PrefixSpan}. +#' +#' @param minSupport Minimal support level. +#' @param maxPatternLength Maximal pattern length. +#' @param maxLocalProjDBSize Maximum number of items (including delimiters used in the internal +#' storage format) allowed in a projected database before local +#' processing. +#' @param sequenceCol name of the sequence column in dataset. +#' @param ... additional argument(s) passed to the method. +#' @return \code{spark.prefixSpan} returns an instance of PrefixSpan +#' @rdname spark.prefixSpan +#' @name spark.prefixSpan +#' @aliases spark.prefixSpan,ANY-method +#' @examples +#' \dontrun{ +#' df <- createDataFrame(list(list(list(list(1L, 2L), list(3L))), +#' list(list(list(1L), list(3L, 2L), list(1L, 2L))), +#' list(list(list(1L, 2L), list(5L))), +#' list(list(list(6L, schema = c("sequence")) +#' prefix_Span <- spark.prefixSpan(minSupport = 0.5, maxPatternLength = 5L, +#' maxLocalProjDBSize = 3200L) +#' frequency <- spark.findFrequentSequentialPatterns(prefix_Span, df) +#' showDF(frequency) +#' } +#' @note spark.prefixSpan since 3.0.0 +setMethod("spark.prefixSpan", signature(), + function(minSupport=0.1, maxPatternLength=10L, + maxLocalProjDBSize=3200L, sequenceCol="sequence") { +if (!is.numeric(minSupport) || minSupport < 0) { + stop("minSupport should be a number with value >= 0.") +} +if (!is.integer(maxPatternLength) || maxPatternLength <= 0) { + stop("maxPatternLength should be a number with value > 0.") +} +if (!is.numeric(maxLocalProjDBSize) || maxLocalProjDBSize <= 0) { + stop("maxLocalProjDBSize should be a number with value > 0.") +} + +jobj <- callJStatic("org.apache.spark.ml.r.PrefixSpanWrapper", "getPrefixSpan", +as.numeric(minSupport), as.integer(maxPatternLength), +as.numeric(maxLocalProjDBSize), as.character(sequenceCol)) +new("PrefixSpan", jobj = jobj) + }) + +# Find frequent sequential patterns. + +#' @param object a PrefixSpan object. +#' @param data A SparkDataFrame. +#' @return A complete set of frequent sequential patterns in the input sequences of itemsets. +#' The returned \code{SparkDataFrame} contains columns of sequence and corresponding +#' frequency. The schema of it will be: +#' \code{sequence: ArrayType(ArrayType(T))} (T is the item type) +#' \code{freq: Long} +#' @rdname spark.prefixSpan +#' @aliases findFrequentSequentialPatterns,PrefixSpan,SparkDataFrame-method +#' @note spark.findFrequentSequentialPatterns(PrefixSpan, SparkDataFrame) since 3.0.0 + +setMethod("spark.findFrequentSequentialPatterns", + signature(object = "PrefixSpan", data = "SparkDataFrame"), --- End diff -- hmm, another one - why isn't `data` the first param? should also be called `x` (the convention in R API) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21710: [SPARK-24207][R]add R API for PrefixSpan
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/21710#discussion_r223573856 --- Diff: R/pkg/R/mllib_fpm.R --- @@ -154,3 +160,74 @@ setMethod("write.ml", signature(object = "FPGrowthModel", path = "character"), function(object, path, overwrite = FALSE) { write_internal(object, path, overwrite) }) + +#' PrefixSpan +#' +#' A parallel PrefixSpan algorithm to mine frequent sequential patterns. +#' \code{spark.prefixSpan} returns an instance of PrefixSpan. +#' \code{spark.findFrequentSequentialPatterns} returns a complete set of frequent sequential +#' patterns. +#' For more details, see +#' \href{https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html#prefixspan}{ +#' PrefixSpan}. +#' +#' @param minSupport Minimal support level. +#' @param maxPatternLength Maximal pattern length. +#' @param maxLocalProjDBSize Maximum number of items (including delimiters used in the internal +#' storage format) allowed in a projected database before local +#' processing. +#' @param sequenceCol name of the sequence column in dataset. +#' @param ... additional argument(s) passed to the method. +#' @return \code{spark.prefixSpan} returns an instance of PrefixSpan +#' @rdname spark.prefixSpan +#' @name spark.prefixSpan +#' @aliases spark.prefixSpan,ANY-method +#' @examples +#' \dontrun{ +#' df <- createDataFrame(list(list(list(list(1L, 2L), list(3L))), +#' list(list(list(1L), list(3L, 2L), list(1L, 2L))), +#' list(list(list(1L, 2L), list(5L))), +#' list(list(list(6L, schema = c("sequence")) +#' prefix_Span <- spark.prefixSpan(minSupport = 0.5, maxPatternLength = 5L, +#' maxLocalProjDBSize = 3200L) +#' frequency <- spark.findFrequentSequentialPatterns(prefix_Span, df) +#' showDF(frequency) +#' } +#' @note spark.prefixSpan since 3.0.0 +setMethod("spark.prefixSpan", signature(), --- End diff -- sorry, I didn't see this before - an empty `signature()` is a bit usual should we either 1) put one of these param as required (if they are required?) and its type into the signature, eg. `signature(minSupport = "numeric")` 2) instead of having `spark.prefixSpan`, take all of these param into `spark.findFrequentSequentialPatterns` ie. ``` setMethod("spark.findFrequentSequentialPatterns", signature(object = "PrefixSpan", data = "SparkDataFrame"), function(object, data, minSupport = 0.1, maxPatternLength = 10L, maxLocalProjDBSize = 3200L, sequenceCol = "sequence") { ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21710: [SPARK-24207][R]add R API for PrefixSpan
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/21710#discussion_r223572974 --- Diff: R/pkg/R/mllib_fpm.R --- @@ -154,3 +160,74 @@ setMethod("write.ml", signature(object = "FPGrowthModel", path = "character"), function(object, path, overwrite = FALSE) { write_internal(object, path, overwrite) }) + +#' PrefixSpan +#' +#' A parallel PrefixSpan algorithm to mine frequent sequential patterns. +#' \code{spark.prefixSpan} returns an instance of PrefixSpan. +#' \code{spark.findFrequentSequentialPatterns} returns a complete set of frequent sequential +#' patterns. +#' For more details, see +#' \href{https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html#prefixspan}{ +#' PrefixSpan}. +#' +#' @param minSupport Minimal support level. +#' @param maxPatternLength Maximal pattern length. +#' @param maxLocalProjDBSize Maximum number of items (including delimiters used in the internal +#' storage format) allowed in a projected database before local +#' processing. +#' @param sequenceCol name of the sequence column in dataset. +#' @param ... additional argument(s) passed to the method. +#' @return \code{spark.prefixSpan} returns an instance of PrefixSpan +#' @rdname spark.prefixSpan +#' @name spark.prefixSpan +#' @aliases spark.prefixSpan,ANY-method +#' @examples +#' \dontrun{ +#' df <- createDataFrame(list(list(list(list(1L, 2L), list(3L))), +#' list(list(list(1L), list(3L, 2L), list(1L, 2L))), +#' list(list(list(1L, 2L), list(5L))), +#' list(list(list(6L, schema = c("sequence")) +#' prefix_Span <- spark.prefixSpan(minSupport = 0.5, maxPatternLength = 5L, +#' maxLocalProjDBSize = 3200L) +#' frequency <- spark.findFrequentSequentialPatterns(prefix_Span, df) +#' showDF(frequency) +#' } +#' @note spark.prefixSpan since 3.0.0 +setMethod("spark.prefixSpan", signature(), + function(minSupport=0.1, maxPatternLength=10L, + maxLocalProjDBSize=3200L, sequenceCol="sequence") { --- End diff -- nit: I think R style we put space around `=` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21710: [SPARK-24207][R]add R API for PrefixSpan
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/21710#discussion_r223572694 --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/PrefixSpanWrapper.scala --- @@ -0,0 +1,34 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.r + +import org.apache.spark.ml.fpm.PrefixSpan + +private[r] object PrefixSpanWrapper { + def getPrefixSpan( + minSupport: Double, + maxPatternLength: Int, + maxLocalProjDBSize: Double, --- End diff -- no native 64 bit int support in R. double is ok --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22645: [SPARK-25566][SPARK-25567][WEBUI][SQL]Support pagination...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22645 @shahidki31 ^^ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22615: [SPARK-25016][BUILD][CORE] Remove support for Had...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22615#discussion_r223571584 --- Diff: docs/index.md --- @@ -30,9 +30,6 @@ Spark runs on Java 8+, Python 2.7+/3.4+ and R 3.1+. For the Scala API, Spark {{s uses Scala {{site.SCALA_BINARY_VERSION}}. You will need to use a compatible Scala version ({{site.SCALA_BINARY_VERSION}}.x). -Note that support for Java 7, Python 2.6 and old Hadoop versions before 2.6.5 were removed as of Spark 2.2.0. -Support for Scala 2.10 was removed as of 2.3.0. --- End diff -- so we are not going to mention supported hadoop version? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22615: [SPARK-25016][BUILD][CORE] Remove support for Had...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22615#discussion_r223571106 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -71,7 +71,7 @@ class HadoopTableReader( // Hadoop honors "mapreduce.job.maps" as hint, // but will ignore when mapreduce.jobtracker.address is "local". - // https://hadoop.apache.org/docs/r2.6.5/hadoop-mapreduce-client/hadoop-mapreduce-client-core/ + // https://hadoop.apache.org/docs/r2.7.6/hadoop-mapreduce-client/hadoop-mapreduce-client-core/ --- End diff -- how come this is 2.7.6 and not 2.7.3 like others? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22615: [SPARK-25016][BUILD][CORE] Remove support for Had...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22615#discussion_r223570946 --- Diff: dev/create-release/release-build.sh --- @@ -191,9 +191,17 @@ if [[ "$1" == "package" ]]; then make_binary_release() { NAME=$1 FLAGS="$MVN_EXTRA_OPTS -B $BASE_RELEASE_PROFILES $2" +# BUILD_PACKAGE can be "withpip", "withr", or both as "withpip,withr" BUILD_PACKAGE=$3 SCALA_VERSION=$4 +if [[ $BUILD_PACKAGE == *"withpip"* ]]; then --- End diff -- one caveat is I'm not sure we have tested building both python and R in "one build". this could be a good thing but if I recall the R build changes some of the binary files under R that gets shipped in the "source release" (these are required R object file) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22615: [SPARK-25016][BUILD][CORE] Remove support for Hadoop 2.6
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22615 I think we all know enough to not to make changes (merge changes) to these config, should be safe. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22645: [SPARK-25566][SPARK-25567][WEBUI][SQL]Support pagination...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22645 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22675: [SPARK-25347][ML][DOC] Spark datasource for image/libsvm...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22675 we need this for 2.4? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21710: [SPARK-24207][R]add R API for PrefixSpan
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/21710#discussion_r223568243 --- Diff: R/pkg/R/mllib_fpm.R --- @@ -154,3 +160,74 @@ setMethod("write.ml", signature(object = "FPGrowthModel", path = "character"), function(object, path, overwrite = FALSE) { write_internal(object, path, overwrite) }) + +#' PrefixSpan +#' +#' A parallel PrefixSpan algorithm to mine frequent sequential patterns. +#' \code{spark.prefixSpan} returns an instance of PrefixSpan. +#' \code{spark.findFrequentSequentialPatterns} returns a complete set of frequent sequential +#' patterns. +#' For more details, see +#' \href{https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html#prefixspan}{ +#' PrefixSpan}. +#' +#' @param minSupport Minimal support level. +#' @param maxPatternLength Maximal pattern length. +#' @param maxLocalProjDBSize Maximum number of items (including delimiters used in the internal +#' storage format) allowed in a projected database before local +#' processing. +#' @param sequenceCol name of the sequence column in dataset. +#' @param ... additional argument(s) passed to the method. +#' @return \code{spark.prefixSpan} returns an instance of PrefixSpan +#' @rdname spark.prefixSpan +#' @name spark.prefixSpan +#' @aliases spark.prefixSpan,ANY-method +#' @examples +#' \dontrun{ +#' df <- createDataFrame(list(list(list(list(1L, 2L), list(3L))), +#' list(list(list(1L), list(3L, 2L), list(1L, 2L))), +#' list(list(list(1L, 2L), list(5L))), +#' list(list(list(6L, schema = c("sequence")) +#' prefix_Span <- spark.prefixSpan(minSupport = 0.5, maxPatternLength = 5L, +#' maxLocalProjDBSize = 3200L) +#' frequency <- spark.findFrequentSequentialPatterns(prefix_Span, df) +#' showDF(frequency) +#' } +#' @note spark.prefixSpan since 3.0.0 +setMethod("spark.prefixSpan", signature(), + function(minSupport=0.1, maxPatternLength=10L, + maxLocalProjDBSize=3200L, sequenceCol="sequence") { +if (!is.numeric(minSupport) || minSupport < 0) { + stop("minSupport should be a number with value >= 0.") +} +if (!is.integer(maxPatternLength) || maxPatternLength <= 0) { + stop("maxPatternLength should be a number with value > 0.") +} +if (!is.numeric(maxLocalProjDBSize) || maxLocalProjDBSize <= 0) { + stop("maxLocalProjDBSize should be a number with value > 0.") +} + +jobj <- callJStatic("org.apache.spark.ml.r.PrefixSpanWrapper", "getPrefixSpan", +as.numeric(minSupport), as.integer(maxPatternLength), +as.numeric(maxLocalProjDBSize), as.character(sequenceCol)) +new("PrefixSpan", jobj = jobj) + }) + +# Find frequent sequential patterns. + +#' @param object a PrefixSpan object. +#' @param data A SparkDataFrame. +#' @return A complete set of frequent sequential patterns in the input sequences of itemsets. +#' The returned \code{SparkDataFrame} contains columns of sequence and corresponding +#' frequency. The schema of it will be: +#' \code{sequence: ArrayType(ArrayType(T))} (T is the item type) +#' \code{freq: Long} +#' @rdname spark.prefixSpan +#' @aliases findFrequentSequentialPatterns,PrefixSpan,SparkDataFrame-method +#' @note spark.findFrequentSequentialPatterns(PrefixSpan, SparkDataFrame) since 3.0.0 + --- End diff -- remove extra line - note this might break doc generation --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21710: [SPARK-24207][R]add R API for PrefixSpan
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/21710#discussion_r223568592 --- Diff: docs/ml-frequent-pattern-mining.md --- @@ -85,3 +85,56 @@ Refer to the [R API docs](api/R/spark.fpGrowth.html) for more details. + +## PrefixSpan --- End diff -- is this not in the doc for branch-2.4? we might need a quick fix on the branch then before the next RC, I think --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21710: [SPARK-24207][R]add R API for PrefixSpan
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/21710#discussion_r223568708 --- Diff: examples/src/main/python/ml/prefixspan_example.py --- @@ -0,0 +1,48 @@ +# --- End diff -- ditto for java and scala and python example --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22645: [SPARK-25566][SPARK-25567][WEBUI][SQL]Support pag...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22645#discussion_r223198079 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/ui/AllExecutionsPage.scala --- @@ -121,74 +122,257 @@ private[ui] class AllExecutionsPage(parent: SQLTab) extends WebUIPage("") with L { if (running.nonEmpty) { -Running Queries: +Running Queries: {running.size} } } { if (completed.nonEmpty) { -Completed Queries: +Completed Queries: {completed.size} } } { if (failed.nonEmpty) { -Failed Queries: +Failed Queries: {failed.size} } } + UIUtils.headerSparkPage(request, "SQL", summary ++ content, parent, Some(5000)) } + + private def executionsTable( +request: HttpServletRequest, +executionTag: String, +executionData: Seq[SQLExecutionUIData], +currentTime: Long, +showRunningJobs: Boolean, +showSucceededJobs: Boolean, +showFailedJobs: Boolean): Seq[Node] = { + +// stripXSS is called to remove suspicious characters used in XSS attacks +val allParameters = request.getParameterMap.asScala.toMap.map { case (k, v) => + UIUtils.stripXSS(k) -> v.map(UIUtils.stripXSS).toSeq +} +val parameterOtherTable = allParameters.filterNot(_._1.startsWith(executionTag)) + .map(para => para._1 + "=" + para._2(0)) + +val parameterExecutionPage = UIUtils.stripXSS(request.getParameter(s"$executionTag.page")) +val parameterExecutionSortColumn = UIUtils.stripXSS(request. + getParameter(s"$executionTag.sort")) +val parameterExecutionSortDesc = UIUtils.stripXSS(request.getParameter(s"$executionTag.desc")) +val parameterExecutionPageSize = UIUtils.stripXSS(request. --- End diff -- could you either have `.` last or first of the next line. for example L172 is at the end and L166 is in front - let's do this consistently with other code in the file --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22608: [SPARK-23257][K8S][TESTS] Kerberos Support Integration T...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22608 > calling an external docker-image like: ifilonenko/hadoop-base:latest for now for now it's probably ok, but is there a solution before the next release? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22275: [SPARK-25274][PYTHON][SQL] In toPandas with Arrow...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22275#discussion_r223197940 --- Diff: python/pyspark/sql/tests.py --- @@ -4434,6 +4434,12 @@ def test_timestamp_dst(self): self.assertPandasEqual(pdf, df_from_python.toPandas()) self.assertPandasEqual(pdf, df_from_pandas.toPandas()) +def test_toPandas_batch_order(self): +df = self.spark.range(64, numPartitions=8).toDF("a") +with self.sql_conf({"spark.sql.execution.arrow.maxRecordsPerBatch": 4}): +pdf, pdf_arrow = self._toPandas_arrow_toggle(df) +self.assertPandasEqual(pdf, pdf_arrow) --- End diff -- that sounds good --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22455: [SPARK-24572][SPARKR] "eager execution" for R she...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22455#discussion_r223197863 --- Diff: R/pkg/R/DataFrame.R --- @@ -246,30 +248,38 @@ setMethod("showDF", #' @note show(SparkDataFrame) since 1.4.0 setMethod("show", "SparkDataFrame", function(object) { -allConf <- sparkR.conf() -if (!is.null(allConf[["spark.sql.repl.eagerEval.enabled"]]) && -identical(allConf[["spark.sql.repl.eagerEval.enabled"]], "true")) { - argsList <- list() - argsList$x <- object - if (!is.null(allConf[["spark.sql.repl.eagerEval.maxNumRows"]])) { -numRows <- as.numeric(allConf[["spark.sql.repl.eagerEval.maxNumRows"]]) -if (numRows > 0) { - argsList$numRows <- numRows +showFunc <- getOption("sparkr.SparkDataFrame.base_show_func") --- End diff -- IMO pretty print should plug in to something more R standard like [printr](https://yihui.name/printr/) or [lemon](https://cran.r-project.org/web/packages/lemon/vignettes/lemon_print.html) or [print.x](https://stat.ethz.ch/R-manual/R-devel/library/base/html/print.dataframe.html) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22455: [SPARK-24572][SPARKR] "eager execution" for R she...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22455#discussion_r223197875 --- Diff: R/pkg/R/DataFrame.R --- @@ -246,30 +248,38 @@ setMethod("showDF", #' @note show(SparkDataFrame) since 1.4.0 setMethod("show", "SparkDataFrame", function(object) { -allConf <- sparkR.conf() -if (!is.null(allConf[["spark.sql.repl.eagerEval.enabled"]]) && -identical(allConf[["spark.sql.repl.eagerEval.enabled"]], "true")) { - argsList <- list() - argsList$x <- object - if (!is.null(allConf[["spark.sql.repl.eagerEval.maxNumRows"]])) { -numRows <- as.numeric(allConf[["spark.sql.repl.eagerEval.maxNumRows"]]) -if (numRows > 0) { - argsList$numRows <- numRows +showFunc <- getOption("sparkr.SparkDataFrame.base_show_func") --- End diff -- could we consider leaving print/show option out? I'd like to get eager compute to work even in basic sparkR / R shell --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22145: [SPARK-25152][K8S] Enable SparkR Integration Tests for K...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22145 @shaneknapp could we do this soon? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22645: [SPARK-25566][SPARK-25567][WEBUI][SQL]Support pag...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22645#discussion_r223174258 --- Diff: core/src/main/scala/org/apache/spark/ui/PagedTable.scala --- @@ -31,7 +31,7 @@ import org.apache.spark.util.Utils * * @param pageSize the number of rows in a page */ -private[ui] abstract class PagedDataSource[T](val pageSize: Int) { +private[spark] abstract class PagedDataSource[T](val pageSize: Int) { --- End diff -- is there a way to do this without opening this all to the spark level? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22639: [SPARK-25647][k8s] Add spark streaming compatibility sui...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22639 @liyinan926 @mccheah --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22625: [SPARK-25637][CORE] SparkException: Could not fin...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22625#discussion_r222550195 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala --- @@ -98,6 +98,8 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp private val reviveThread = ThreadUtils.newDaemonSingleThreadScheduledExecutor("driver-revive-thread") + private var driverEndpointStopped = false --- End diff -- shouldn't this be inside class DriverEndpoint? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22615: [SPARK-25016][BUILD][CORE] Remove support for Had...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22615#discussion_r222549104 --- Diff: dev/appveyor-install-dependencies.ps1 --- @@ -95,7 +95,7 @@ $env:MAVEN_OPTS = "-Xmx2g -XX:ReservedCodeCacheSize=512m" Pop-Location # == Hadoop bin package -$hadoopVer = "2.6.4" +$hadoopVer = "2.7.3" --- End diff -- this could be easy to forget - the next time we need to update this, we might be searching and replacing "2.7.3" if this is 2.7.1 then it might be missed... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21710: [SPARK-24207][R]add R API for PrefixSpan
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/21710 actually, as of 2 hrs ago seems like we are skipping 2.5, to 3.0. we could wait a bit until that is merged --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] zeppelin issue #3200: ZEPPELIN-3804. Separate log file for each interpreter ...
Github user felixcheung commented on the issue: https://github.com/apache/zeppelin/pull/3200 how to find out which log file is for which? ---
[GitHub] zeppelin issue #3132: [ZEPPELIN-3701].Missing first several '0' and losing d...
Github user felixcheung commented on the issue: https://github.com/apache/zeppelin/pull/3132 this should be in 0.9. as for release date plan, please ask on dev@zeppelin.apache.org ---
[GitHub] spark issue #20405: [SPARK-23229][SQL] Dataset.hint should use planWithBarri...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/20405 probably. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22527: [SPARK-17952][SQL] Nested Java beans support in createDa...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22527 Jenkins, ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] zeppelin issue #3197: docs: fix punctuation
Github user felixcheung commented on the issue: https://github.com/apache/zeppelin/pull/3197 can you also check other .md? it's probably best to have one PR ---
[GitHub] spark issue #21669: [SPARK-23257][K8S] Kerberos Support for Spark on K8S
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/21669 like it, but we could also first support cluster mode and add client mode after. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [SPARK-25572][SPARKR] test only if not cran
Repository: spark Updated Branches: refs/heads/master f246813af -> f4b138082 [SPARK-25572][SPARKR] test only if not cran ## What changes were proposed in this pull request? CRAN doesn't seem to respect the system requirements as running tests - we have seen cases where SparkR is run on Java 10, which unfortunately Spark does not start on. For 2.4, lets attempt skipping all tests ## How was this patch tested? manual, jenkins, appveyor Author: Felix Cheung Closes #22589 from felixcheung/ralltests. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f4b13808 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f4b13808 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f4b13808 Branch: refs/heads/master Commit: f4b138082ff91be74b0f5bbe19cdb90dd9e5f131 Parents: f246813 Author: Felix Cheung Authored: Sat Sep 29 14:48:32 2018 -0700 Committer: Felix Cheung Committed: Sat Sep 29 14:48:32 2018 -0700 -- R/pkg/tests/run-all.R | 83 -- 1 file changed, 44 insertions(+), 39 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f4b13808/R/pkg/tests/run-all.R -- diff --git a/R/pkg/tests/run-all.R b/R/pkg/tests/run-all.R index 94d7518..1e96418 100644 --- a/R/pkg/tests/run-all.R +++ b/R/pkg/tests/run-all.R @@ -18,50 +18,55 @@ library(testthat) library(SparkR) -# Turn all warnings into errors -options("warn" = 2) +# SPARK-25572 +if (identical(Sys.getenv("NOT_CRAN"), "true")) { -if (.Platform$OS.type == "windows") { - Sys.setenv(TZ = "GMT") -} + # Turn all warnings into errors + options("warn" = 2) -# Setup global test environment -# Install Spark first to set SPARK_HOME + if (.Platform$OS.type == "windows") { +Sys.setenv(TZ = "GMT") + } -# NOTE(shivaram): We set overwrite to handle any old tar.gz files or directories left behind on -# CRAN machines. For Jenkins we should already have SPARK_HOME set. -install.spark(overwrite = TRUE) + # Setup global test environment + # Install Spark first to set SPARK_HOME -sparkRDir <- file.path(Sys.getenv("SPARK_HOME"), "R") -sparkRWhitelistSQLDirs <- c("spark-warehouse", "metastore_db") -invisible(lapply(sparkRWhitelistSQLDirs, - function(x) { unlink(file.path(sparkRDir, x), recursive = TRUE, force = TRUE)})) -sparkRFilesBefore <- list.files(path = sparkRDir, all.files = TRUE) + # NOTE(shivaram): We set overwrite to handle any old tar.gz files or directories left behind on + # CRAN machines. For Jenkins we should already have SPARK_HOME set. + install.spark(overwrite = TRUE) -sparkRTestMaster <- "local[1]" -sparkRTestConfig <- list() -if (identical(Sys.getenv("NOT_CRAN"), "true")) { - sparkRTestMaster <- "" -} else { - # Disable hsperfdata on CRAN - old_java_opt <- Sys.getenv("_JAVA_OPTIONS") - Sys.setenv("_JAVA_OPTIONS" = paste("-XX:-UsePerfData", old_java_opt)) - tmpDir <- tempdir() - tmpArg <- paste0("-Djava.io.tmpdir=", tmpDir) - sparkRTestConfig <- list(spark.driver.extraJavaOptions = tmpArg, - spark.executor.extraJavaOptions = tmpArg) -} + sparkRDir <- file.path(Sys.getenv("SPARK_HOME"), "R") + sparkRWhitelistSQLDirs <- c("spark-warehouse", "metastore_db") + invisible(lapply(sparkRWhitelistSQLDirs, + function(x) { unlink(file.path(sparkRDir, x), recursive = TRUE, force = TRUE)})) + sparkRFilesBefore <- list.files(path = sparkRDir, all.files = TRUE) -test_package("SparkR") + sparkRTestMaster <- "local[1]" + sparkRTestConfig <- list() + if (identical(Sys.getenv("NOT_CRAN"), "true")) { +sparkRTestMaster <- "" + } else { +# Disable hsperfdata on CRAN +old_java_opt <- Sys.getenv("_JAVA_OPTIONS") +Sys.setenv("_JAVA_OPTIONS" = paste("-XX:-UsePerfData", old_java_opt)) +tmpDir <- tempdir() +tmpArg <- paste0("-Djava.io.tmpdir=", tmpDir) +sparkRTestConfig <- list(spark.driver.extraJavaOptions = tmpArg, + spark.executor.extraJavaOptions = tmpArg) + } -if (identical(Sys.getenv("NOT_CRAN"), "true")) { - # set random seed for predictable results. mostly for base's sample() in tree and classification - set.seed(42) - # for testthat 1.0.2 later, change reporter from "summary" to default_reporter() - testthat:::run_tests("SparkR", -
[GitHub] spark issue #22589: [SPARK-25572][SPARKR] test only if not cran
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22589 merged to master/2.4 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [SPARK-25572][SPARKR] test only if not cran
Repository: spark Updated Branches: refs/heads/branch-2.4 a14306b1d -> fef30271b [SPARK-25572][SPARKR] test only if not cran ## What changes were proposed in this pull request? CRAN doesn't seem to respect the system requirements as running tests - we have seen cases where SparkR is run on Java 10, which unfortunately Spark does not start on. For 2.4, lets attempt skipping all tests ## How was this patch tested? manual, jenkins, appveyor Author: Felix Cheung Closes #22589 from felixcheung/ralltests. (cherry picked from commit f4b138082ff91be74b0f5bbe19cdb90dd9e5f131) Signed-off-by: Felix Cheung Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fef30271 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fef30271 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fef30271 Branch: refs/heads/branch-2.4 Commit: fef30271bc4a77e7f94b240cd8d7fef71f47935e Parents: a14306b Author: Felix Cheung Authored: Sat Sep 29 14:48:32 2018 -0700 Committer: Felix Cheung Committed: Sat Sep 29 14:48:51 2018 -0700 -- R/pkg/tests/run-all.R | 83 -- 1 file changed, 44 insertions(+), 39 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/fef30271/R/pkg/tests/run-all.R -- diff --git a/R/pkg/tests/run-all.R b/R/pkg/tests/run-all.R index 94d7518..1e96418 100644 --- a/R/pkg/tests/run-all.R +++ b/R/pkg/tests/run-all.R @@ -18,50 +18,55 @@ library(testthat) library(SparkR) -# Turn all warnings into errors -options("warn" = 2) +# SPARK-25572 +if (identical(Sys.getenv("NOT_CRAN"), "true")) { -if (.Platform$OS.type == "windows") { - Sys.setenv(TZ = "GMT") -} + # Turn all warnings into errors + options("warn" = 2) -# Setup global test environment -# Install Spark first to set SPARK_HOME + if (.Platform$OS.type == "windows") { +Sys.setenv(TZ = "GMT") + } -# NOTE(shivaram): We set overwrite to handle any old tar.gz files or directories left behind on -# CRAN machines. For Jenkins we should already have SPARK_HOME set. -install.spark(overwrite = TRUE) + # Setup global test environment + # Install Spark first to set SPARK_HOME -sparkRDir <- file.path(Sys.getenv("SPARK_HOME"), "R") -sparkRWhitelistSQLDirs <- c("spark-warehouse", "metastore_db") -invisible(lapply(sparkRWhitelistSQLDirs, - function(x) { unlink(file.path(sparkRDir, x), recursive = TRUE, force = TRUE)})) -sparkRFilesBefore <- list.files(path = sparkRDir, all.files = TRUE) + # NOTE(shivaram): We set overwrite to handle any old tar.gz files or directories left behind on + # CRAN machines. For Jenkins we should already have SPARK_HOME set. + install.spark(overwrite = TRUE) -sparkRTestMaster <- "local[1]" -sparkRTestConfig <- list() -if (identical(Sys.getenv("NOT_CRAN"), "true")) { - sparkRTestMaster <- "" -} else { - # Disable hsperfdata on CRAN - old_java_opt <- Sys.getenv("_JAVA_OPTIONS") - Sys.setenv("_JAVA_OPTIONS" = paste("-XX:-UsePerfData", old_java_opt)) - tmpDir <- tempdir() - tmpArg <- paste0("-Djava.io.tmpdir=", tmpDir) - sparkRTestConfig <- list(spark.driver.extraJavaOptions = tmpArg, - spark.executor.extraJavaOptions = tmpArg) -} + sparkRDir <- file.path(Sys.getenv("SPARK_HOME"), "R") + sparkRWhitelistSQLDirs <- c("spark-warehouse", "metastore_db") + invisible(lapply(sparkRWhitelistSQLDirs, + function(x) { unlink(file.path(sparkRDir, x), recursive = TRUE, force = TRUE)})) + sparkRFilesBefore <- list.files(path = sparkRDir, all.files = TRUE) -test_package("SparkR") + sparkRTestMaster <- "local[1]" + sparkRTestConfig <- list() + if (identical(Sys.getenv("NOT_CRAN"), "true")) { +sparkRTestMaster <- "" + } else { +# Disable hsperfdata on CRAN +old_java_opt <- Sys.getenv("_JAVA_OPTIONS") +Sys.setenv("_JAVA_OPTIONS" = paste("-XX:-UsePerfData", old_java_opt)) +tmpDir <- tempdir() +tmpArg <- paste0("-Djava.io.tmpdir=", tmpDir) +sparkRTestConfig <- list(spark.driver.extraJavaOptions = tmpArg, + spark.executor.extraJavaOptions = tmpArg) + } -if (identical(Sys.getenv("NOT_CRAN"), "true")) { - # set random seed for predictable results. mostly for base's sample() in tree and classification - set.seed(42) - # for testthat 1.0.2 later, change reporter f
[GitHub] spark pull request #22589: [SPARK-25572][SPARKR] test only if not cran
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22589#discussion_r221440760 --- Diff: R/pkg/tests/run-all.R --- @@ -18,50 +18,55 @@ library(testthat) library(SparkR) -# Turn all warnings into errors -options("warn" = 2) +# SPARK-25572 +if (identical(Sys.getenv("NOT_CRAN"), "true")) { -if (.Platform$OS.type == "windows") { - Sys.setenv(TZ = "GMT") -} + # Turn all warnings into errors + options("warn" = 2) -# Setup global test environment -# Install Spark first to set SPARK_HOME + if (.Platform$OS.type == "windows") { +Sys.setenv(TZ = "GMT") + } -# NOTE(shivaram): We set overwrite to handle any old tar.gz files or directories left behind on -# CRAN machines. For Jenkins we should already have SPARK_HOME set. -install.spark(overwrite = TRUE) + # Setup global test environment + # Install Spark first to set SPARK_HOME -sparkRDir <- file.path(Sys.getenv("SPARK_HOME"), "R") -sparkRWhitelistSQLDirs <- c("spark-warehouse", "metastore_db") -invisible(lapply(sparkRWhitelistSQLDirs, - function(x) { unlink(file.path(sparkRDir, x), recursive = TRUE, force = TRUE)})) -sparkRFilesBefore <- list.files(path = sparkRDir, all.files = TRUE) + # NOTE(shivaram): We set overwrite to handle any old tar.gz files or directories left behind on + # CRAN machines. For Jenkins we should already have SPARK_HOME set. + install.spark(overwrite = TRUE) -sparkRTestMaster <- "local[1]" -sparkRTestConfig <- list() -if (identical(Sys.getenv("NOT_CRAN"), "true")) { - sparkRTestMaster <- "" -} else { - # Disable hsperfdata on CRAN - old_java_opt <- Sys.getenv("_JAVA_OPTIONS") - Sys.setenv("_JAVA_OPTIONS" = paste("-XX:-UsePerfData", old_java_opt)) - tmpDir <- tempdir() - tmpArg <- paste0("-Djava.io.tmpdir=", tmpDir) - sparkRTestConfig <- list(spark.driver.extraJavaOptions = tmpArg, - spark.executor.extraJavaOptions = tmpArg) -} + sparkRDir <- file.path(Sys.getenv("SPARK_HOME"), "R") + sparkRWhitelistSQLDirs <- c("spark-warehouse", "metastore_db") + invisible(lapply(sparkRWhitelistSQLDirs, + function(x) { unlink(file.path(sparkRDir, x), recursive = TRUE, force = TRUE)})) + sparkRFilesBefore <- list.files(path = sparkRDir, all.files = TRUE) -test_package("SparkR") + sparkRTestMaster <- "local[1]" + sparkRTestConfig <- list() + if (identical(Sys.getenv("NOT_CRAN"), "true")) { +sparkRTestMaster <- "" + } else { +# Disable hsperfdata on CRAN +old_java_opt <- Sys.getenv("_JAVA_OPTIONS") +Sys.setenv("_JAVA_OPTIONS" = paste("-XX:-UsePerfData", old_java_opt)) +tmpDir <- tempdir() +tmpArg <- paste0("-Djava.io.tmpdir=", tmpDir) +sparkRTestConfig <- list(spark.driver.extraJavaOptions = tmpArg, + spark.executor.extraJavaOptions = tmpArg) + } -if (identical(Sys.getenv("NOT_CRAN"), "true")) { - # set random seed for predictable results. mostly for base's sample() in tree and classification - set.seed(42) - # for testthat 1.0.2 later, change reporter from "summary" to default_reporter() - testthat:::run_tests("SparkR", - file.path(sparkRDir, "pkg", "tests", "fulltests"), - NULL, - "summary") -} + test_package("SparkR") + + if (identical(Sys.getenv("NOT_CRAN"), "true")) { --- End diff -- We are trying this now - we can clean it up if this works --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22589: [SPARK-25572][SPARKR] test only if not cran
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22589 We are tryin this now - we can clean it up if this works --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21710: [SPARK-24207][R]add R API for PrefixSpan
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/21710 @huaxingao do you want to bring this up to 2.5.0? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22589: [SPARK-25572][SPARKR] test only if not cran
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22589 Jenkins, retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22588: [SPARK-25262][DOC][FOLLOWUP] Fix link tags in html table
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22588 @ifilonenko @liyinan926 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22589: [SPARK-25572][SPARKR] test only if not cran
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22589 Jenkins, retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22589: [SPARK-25572][SPARKR] test only if not cran
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22589 really odd error, not sure if it's related to an earlier change? @HyukjinKwon what do you think? ``` . Error: Test correct concurrency of RRDD.compute() (@test_rdd.R#800) - org.apache.spark.SparkException: Job aborted due to stage failure: Task 87 in stage 201.0 failed 1 times, most recent failure: Lost task 87.0 in stage 201.0 (TID 439, localhost, executor driver): java.lang.Assertionassertion failed at scala.Predef$.assert(Predef.scala:156) at org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:120) at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:127) at org.apache.spark.api.r.RAuthHelper.readUtf8(RAuthHelper.scala:29) at org.apache.spark.security.SocketAuthHelper.authClient(SocketAuthHelper.scala:57) at org.apache.spark.api.r.RRunner.compute(RRunner.scala:79) at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:54) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:325) at org.apache.spark.rdd.RDD.iterator(RDD.scala:289) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:325) at org.apache.spark.rdd.RDD.iterator(RDD.scala:289) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:419) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:425) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22589: [SPARK-25572][SPARKR] test only if not cran
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22589 Jenkins, retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22589: [SPARK-25572][SPARKR] test only if not cran
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22589 diff w/o whitespace for actual change --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22589: [SPARK-25572][SPARKR] test only if not cran
GitHub user felixcheung opened a pull request: https://github.com/apache/spark/pull/22589 [SPARK-25572][SPARKR] test only if not cran ## What changes were proposed in this pull request? CRAN doesn't seem to respect the system requirements as running tests - we have seen cases where SparkR is run on Java 10, which unfortunately Spark does not start on. For 2.4, lets attempt skipping all tests ## How was this patch tested? manual, jenkins, appveyor You can merge this pull request into a Git repository by running: $ git pull https://github.com/felixcheung/spark ralltests Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22589.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22589 commit 3b7414d9adf55ef74ce2d81403e570e5d6951a05 Author: Felix Cheung Date: 2018-09-29T05:10:23Z test only if not cran --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22379: [SPARK-25393][SQL] Adding new function from_csv()
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22379#discussion_r221415711 --- Diff: R/pkg/R/functions.R --- @@ -2203,6 +2209,23 @@ setMethod("from_json", signature(x = "Column", schema = "characterOrstructType") column(jc) }) +#' @details +#' \code{from_csv}: Parses a column containing a CSV string into a Column of \code{structType} +#' with the specified \code{schema}. +#' If the string is unparseable, the Column will contain the value NA. +#' +#' @rdname column_collection_functions +#' @aliases from_csv from_csv,Column,character-method +#' @note from_csv since 2.5.0 --- End diff -- consider adding example as in https://github.com/apache/spark/pull/22379/files#diff-d97f9adc2dcac0703568c799ff106987R2180? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22455: [SPARK-24572][SPARKR] "eager execution" for R she...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22455#discussion_r221416198 --- Diff: R/pkg/tests/fulltests/test_eager_execution.R --- @@ -21,11 +21,7 @@ context("Show SparkDataFrame when eager execution is enabled.") test_that("eager execution is not enabled", { # Start Spark session without eager execution enabled - sparkSession <- if (windows_with_hadoop()) { -sparkR.session(master = sparkRTestMaster) - } else { -sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE) - } + sparkR.session(master = sparkRTestMaster) --- End diff -- as mentioned here https://github.com/apache/spark/pull/22455#discussion_r220030686 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22455: [SPARK-24572][SPARKR] "eager execution" for R she...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22455#discussion_r221416164 --- Diff: R/pkg/tests/fulltests/test_eager_execution.R --- @@ -21,11 +21,7 @@ context("Show SparkDataFrame when eager execution is enabled.") test_that("eager execution is not enabled", { # Start Spark session without eager execution enabled - sparkSession <- if (windows_with_hadoop()) { -sparkR.session(master = sparkRTestMaster) - } else { -sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE) - } + sparkR.session(master = sparkRTestMaster) --- End diff -- you should definitely set `enableHiveSupport = FALSE` - historically this hasn't work well in other R tests when hive catalog is enabled --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22455: [SPARK-24572][SPARKR] "eager execution" for R she...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22455#discussion_r221415826 --- Diff: R/pkg/R/DataFrame.R --- @@ -246,30 +248,38 @@ setMethod("showDF", #' @note show(SparkDataFrame) since 1.4.0 setMethod("show", "SparkDataFrame", function(object) { -allConf <- sparkR.conf() -if (!is.null(allConf[["spark.sql.repl.eagerEval.enabled"]]) && -identical(allConf[["spark.sql.repl.eagerEval.enabled"]], "true")) { - argsList <- list() - argsList$x <- object - if (!is.null(allConf[["spark.sql.repl.eagerEval.maxNumRows"]])) { -numRows <- as.numeric(allConf[["spark.sql.repl.eagerEval.maxNumRows"]]) -if (numRows > 0) { - argsList$numRows <- numRows +showFunc <- getOption("sparkr.SparkDataFrame.base_show_func") --- End diff -- hmm, this naming convention? typically we don't mix `.` and `_` and I don't think we have anything with `SparkDataFrame` seems like we have `spark.sparkr.something` before https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/core/src/main/scala/org/apache/spark/api/r/RRunner.scala#L343 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] zeppelin issue #3185: Copy-editing overview.md
Github user felixcheung commented on the issue: https://github.com/apache/zeppelin/pull/3185 this was merged? is github not syncing? ---
[GitHub] spark pull request #22455: [SPARK-24572][SPARKR] "eager execution" for R she...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22455#discussion_r220034035 --- Diff: R/pkg/R/DataFrame.R --- @@ -244,11 +246,25 @@ setMethod("showDF", #' @note show(SparkDataFrame) since 1.4.0 setMethod("show", "SparkDataFrame", function(object) { -cols <- lapply(dtypes(object), function(l) { - paste(l, collapse = ":") -}) -s <- paste(cols, collapse = ", ") -cat(paste(class(object), "[", s, "]\n", sep = "")) +if (identical(sparkR.conf("spark.sql.repl.eagerEval.enabled", "false")[[1]], "true")) { + argsList <- list() + argsList$x <- object + numRows <- as.numeric(sparkR.conf("spark.sql.repl.eagerEval.maxNumRows", "0")[[1]]) + if (numRows > 0) { +argsList$numRows <- numRows + } + truncate <- as.numeric(sparkR.conf("spark.sql.repl.eagerEval.truncate", "0")[[1]]) --- End diff -- just asking since this variation was not in unit test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22455: [SPARK-24572][SPARKR] "eager execution" for R she...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22455#discussion_r220033577 --- Diff: R/pkg/R/DataFrame.R --- @@ -244,11 +246,31 @@ setMethod("showDF", #' @note show(SparkDataFrame) since 1.4.0 setMethod("show", "SparkDataFrame", function(object) { -cols <- lapply(dtypes(object), function(l) { - paste(l, collapse = ":") -}) -s <- paste(cols, collapse = ", ") -cat(paste(class(object), "[", s, "]\n", sep = "")) +allConf <- sparkR.conf() +if (!is.null(allConf[["spark.sql.repl.eagerEval.enabled"]]) && --- End diff -- you can also save off `foo <- allConf[["spark.sql.repl.eagerEval.enabled"]]` and reuse it, ditto for other cases --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22455: [SPARK-24572][SPARKR] "eager execution" for R she...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22455#discussion_r220033775 --- Diff: docs/sparkr.md --- @@ -450,6 +450,48 @@ print(model.summaries) {% endhighlight %} +### Eager execution + +If eager execution is enabled, the data will be returned to R client immediately when the `SparkDataFrame` is created. By default, eager execution is not enabled and can be enabled by setting the configuration property `spark.sql.repl.eagerEval.enabled` to `true` when the `SparkSession` is started up. + +Maximum number of rows and maximum number of characters per column of data to display can be controlled by `spark.sql.repl.eagerEval.maxNumRows` and `spark.sql.repl.eagerEval.truncate` configuration properties, respectively. These properties are only effective when eager execution is enabled. If these properties are not set explicitly, by default, data up to 20 rows and up to 20 characters per column will be showed. + + +{% highlight r %} + +# Start up spark session with eager execution enabled +sparkR.session(master = "local[*]", + sparkConfig = list(spark.sql.repl.eagerEval.enabled = "true", + spark.sql.repl.eagerEval.maxNumRows = as.integer(10))) + +# Create a grouped and sorted SparkDataFrame +df <- createDataFrame(faithful) +df2 <- arrange(summarize(groupBy(df, df$waiting), count = n(df$waiting)), "waiting") + +# Similar to R data.frame, displays the data returned, instead of SparkDataFrame class string +df2 + +##+---+-+ +##|waiting|count| +##+---+-+ +##| 43.0|1| +##| 45.0|3| +##| 46.0|5| +##| 47.0|4| +##| 48.0|3| +##| 49.0|5| +##| 50.0|5| +##| 51.0|6| +##| 52.0|5| +##| 53.0|7| +##+---+-+ +##only showing top 10 rows + +{% endhighlight %} + + +Note that to enable eager execution through `sparkR` command, add `spark.sql.repl.eagerEval.enabled=true` configuration property to the `--conf` option. --- End diff -- `sparkR command` - I think should be `sparkR shell`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22455: [SPARK-24572][SPARKR] "eager execution" for R she...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22455#discussion_r220033684 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -1491,9 +1491,10 @@ object SQLConf { val REPL_EAGER_EVAL_ENABLED = buildConf("spark.sql.repl.eagerEval.enabled") .doc("Enables eager evaluation or not. When true, the top K rows of Dataset will be " + "displayed if and only if the REPL supports the eager evaluation. Currently, the " + - "eager evaluation is only supported in PySpark. For the notebooks like Jupyter, " + - "the HTML table (generated by _repr_html_) will be returned. For plain Python REPL, " + - "the returned outputs are formatted like dataframe.show().") + "eager evaluation is supported in PySpark and SparkR. In PySpark, for the notebooks like " + + "Jupyter, the HTML table (generated by _repr_html_) will be returned. For plain Python " + + "REPL, the returned outputs are formatted like dataframe.show(). In SparkR, the returned " + + "outputs are showed as R data.frame.") --- End diff -- ` are showed as R data.frame` - it's not actually the exact format, so how about we say ` are showed similar to R data.frame would` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22455: [SPARK-24572][SPARKR] "eager execution" for R she...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22455#discussion_r220030686 --- Diff: R/pkg/tests/fulltests/test_eager_execution.R --- @@ -0,0 +1,60 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +library(testthat) + +context("Show SparkDataFrame when eager execution is enabled.") + +test_that("eager execution is not enabled", { + # Start Spark session without eager execution enabled + sparkSession <- if (windows_with_hadoop()) { +sparkR.session(master = sparkRTestMaster) + } else { +sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE) --- End diff -- actually you can use `, enableHiveSupport = FALSE` for this file in all cases - we only need the opposite for SQL tests --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22537: [SPARK-21291][R] add R partitionBy API in DataFrame
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22537 could you kick off appveyor again by close/re-open this PR? also might need to split the JIRA - since it's referring to both partition and bucket --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
zeppelin git commit: Copy-editing overview.md
Repository: zeppelin Updated Branches: refs/heads/master 86ee813e4 -> 6c623d323 Copy-editing overview.md Extensive grammar revisions for zeppelin/docs/usage/interpreter/overview.md ### What is this PR for? The documentation at zeppelin/docs/usage/interpreter/overview.md contained numerous instances of bad grammar. This PR improves it. ### What type of PR is it? Documentation Author: Gerard de Melo Author: User Closes #3185 from gdemelo/patch-1 and squashes the following commits: b16ddf2 [User] Minor changes to wording 8b7773b [Gerard de Melo] Copy-editing overview.md Project: http://git-wip-us.apache.org/repos/asf/zeppelin/repo Commit: http://git-wip-us.apache.org/repos/asf/zeppelin/commit/6c623d32 Tree: http://git-wip-us.apache.org/repos/asf/zeppelin/tree/6c623d32 Diff: http://git-wip-us.apache.org/repos/asf/zeppelin/diff/6c623d32 Branch: refs/heads/master Commit: 6c623d323fa04350d89b076cd2b3a760433370fc Parents: 86ee813 Author: Gerard de Melo Authored: Sun Sep 23 20:37:51 2018 -0400 Committer: Felix Cheung Committed: Mon Sep 24 17:59:52 2018 -0700 -- docs/usage/interpreter/overview.md | 80 - 1 file changed, 39 insertions(+), 41 deletions(-) -- http://git-wip-us.apache.org/repos/asf/zeppelin/blob/6c623d32/docs/usage/interpreter/overview.md -- diff --git a/docs/usage/interpreter/overview.md b/docs/usage/interpreter/overview.md index 5b567c7..6580a59 100644 --- a/docs/usage/interpreter/overview.md +++ b/docs/usage/interpreter/overview.md @@ -1,7 +1,7 @@ --- layout: page title: "Interpreter in Apache Zeppelin" -description: "This document explains about the role of interpreters, interpreters group and interpreter settings in Apache Zeppelin. The concept of Zeppelin interpreter allows any language/data-processing-backend to be plugged into Zeppelin." +description: "This document explains the role of interpreters, interpreter groups and interpreter settings in Apache Zeppelin. The concept of Zeppelin interpreters allows any language or data-processing backend to be plugged into Zeppelin." group: usage/interpreter ---
[GitHub] spark issue #22455: [SPARK-24572][SPARKR] "eager execution" for R shell, IDE
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22455 > Looks like the test was terminated in the middle, not likely related to the any code change. Could someone please ask for retest? Thanks. appveyor one? you need to close and re-open this PR to trigger it --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] zeppelin issue #3185: Copy-editing overview.md
Github user felixcheung commented on the issue: https://github.com/apache/zeppelin/pull/3185 LGTM merging if no more comment ---
[GitHub] zeppelin issue #3185: Copy-editing overview.md
Github user felixcheung commented on the issue: https://github.com/apache/zeppelin/pull/3185 great, thx! ---
[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/7#discussion_r219714851 --- Diff: R/pkg/R/functions.R --- @@ -3404,19 +3404,27 @@ setMethod("collect_set", #' Equivalent to \code{split} SQL function. #' #' @rdname column_string_functions +#' @param limit determines the length of the returned array. +#' \itemize{ +#' \item \code{limit > 0}: length of the array will be at most \code{limit} +#' \item \code{limit <= 0}: the returned array can have any length +#' } +#' #' @aliases split_string split_string,Column-method #' @examples #' #' \dontrun{ #' head(select(df, split_string(df$Sex, "a"))) #' head(select(df, split_string(df$Class, "\\d"))) +#' head(select(df, split_string(df$Class, "\\d", 2))) #' # This is equivalent to the following SQL expression #' head(selectExpr(df, "split(Class, 'd')"))} --- End diff -- good point - also the example should run in the order documented. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22455: [SPARK-24572][SPARKR] "eager execution" for R she...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22455#discussion_r219689581 --- Diff: R/pkg/tests/fulltests/test_eager_execution.R --- @@ -0,0 +1,61 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +library(testthat) + +context("Show SparkDataFrame when eager execution is enabled.") + +test_that("eager execution is not enabled", { + # Start Spark session without eager execution enabled + sparkSession <- if (windows_with_hadoop()) { +sparkR.session(master = sparkRTestMaster) + } else { +sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE) + } + + df <- createDataFrame(faithful) + expect_is(df, "SparkDataFrame") + expected <- "eruptions:double, waiting:double" + expect_output(show(df), expected) + + # Stop Spark session + sparkR.session.stop() +}) + +test_that("eager execution is enabled", { + # Start Spark session with eager execution enabled + sparkSession <- if (windows_with_hadoop()) { +sparkR.session(master = sparkRTestMaster, + sparkConfig = list(spark.sql.repl.eagerEval.enabled = "true", + spark.sql.repl.eagerEval.maxNumRows = as.integer(10))) + } else { +sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE, + sparkConfig = list(spark.sql.repl.eagerEval.enabled = "true", + spark.sql.repl.eagerEval.maxNumRows = as.integer(10))) --- End diff -- I'd set sparkConfig to a list before this and simply reuse it for both cases instead of having them duplicated --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22455: [SPARK-24572][SPARKR] "eager execution" for R she...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22455#discussion_r219689697 --- Diff: R/pkg/R/DataFrame.R --- @@ -244,11 +246,25 @@ setMethod("showDF", #' @note show(SparkDataFrame) since 1.4.0 setMethod("show", "SparkDataFrame", function(object) { -cols <- lapply(dtypes(object), function(l) { - paste(l, collapse = ":") -}) -s <- paste(cols, collapse = ", ") -cat(paste(class(object), "[", s, "]\n", sep = "")) +if (identical(sparkR.conf("spark.sql.repl.eagerEval.enabled", "false")[[1]], "true")) { + argsList <- list() + argsList$x <- object + numRows <- as.numeric(sparkR.conf("spark.sql.repl.eagerEval.maxNumRows", "0")[[1]]) + if (numRows > 0) { +argsList$numRows <- numRows + } + truncate <- as.numeric(sparkR.conf("spark.sql.repl.eagerEval.truncate", "0")[[1]]) --- End diff -- have you tested with the case when only `spark.sql.repl.eagerEval.truncate` is set, and `spark.sql.repl.eagerEval.maxNumRows` is not? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22455: [SPARK-24572][SPARKR] "eager execution" for R she...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22455#discussion_r219689694 --- Diff: R/pkg/R/DataFrame.R --- @@ -244,11 +246,25 @@ setMethod("showDF", #' @note show(SparkDataFrame) since 1.4.0 setMethod("show", "SparkDataFrame", function(object) { -cols <- lapply(dtypes(object), function(l) { - paste(l, collapse = ":") -}) -s <- paste(cols, collapse = ", ") -cat(paste(class(object), "[", s, "]\n", sep = "")) +if (identical(sparkR.conf("spark.sql.repl.eagerEval.enabled", "false")[[1]], "true")) { --- End diff -- instead of calling sparkR.conf three times, I'm wondering if it cleaner/safer to get a copy like in here: https://github.com/apache/spark/blob/95b177c8f0862c6965a7c3cd76b3935c975adee9/R/pkg/tests/fulltests/test_sparkSQL.R#L3538 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22305: [SPARK-24561][SQL][Python] User-defined window aggregati...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22305 @gatorsmile @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22275: [SPARK-25274][PYTHON][SQL] In toPandas with Arrow...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22275#discussion_r219404072 --- Diff: python/pyspark/sql/tests.py --- @@ -4434,6 +4434,12 @@ def test_timestamp_dst(self): self.assertPandasEqual(pdf, df_from_python.toPandas()) self.assertPandasEqual(pdf, df_from_pandas.toPandas()) +def test_toPandas_batch_order(self): +df = self.spark.range(64, numPartitions=8).toDF("a") +with self.sql_conf({"spark.sql.execution.arrow.maxRecordsPerBatch": 4}): +pdf, pdf_arrow = self._toPandas_arrow_toggle(df) +self.assertPandasEqual(pdf, pdf_arrow) --- End diff -- hm, is this test case "enough" to trigger any possible problem just by random? would increasing the number of batch or num record per batch increase the chance of streaming order or concurrency issue perhaps? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22275: [SPARK-25274][PYTHON][SQL] In toPandas with Arrow send o...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22275 got it. so the size of the each batch could grow. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22455: [SPARK-24572][SPARKR] "eager execution" for R she...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22455#discussion_r219402870 --- Diff: R/pkg/R/DataFrame.R --- @@ -244,11 +245,15 @@ setMethod("showDF", #' @note show(SparkDataFrame) since 1.4.0 setMethod("show", "SparkDataFrame", function(object) { -cols <- lapply(dtypes(object), function(l) { - paste(l, collapse = ":") -}) -s <- paste(cols, collapse = ", ") -cat(paste(class(object), "[", s, "]\n", sep = "")) +if (identical(sparkR.conf("spark.sql.repl.eagerEval.enabled", "false")[[1]], "true")) { --- End diff -- respecting `spark.sql.repl.eagerEval.maxNumRows` somewhat makes sense. but instead of changing `showDF` which has other cases beyond eagerEval, we could change where `showDF` is called by `show`, and pass the max rows. here https://github.com/apache/spark/pull/22455/files#diff-508641a8bd6c6b59f3e77c80cdcfa6a9R249 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22455: [SPARK-24572][SPARKR] "eager execution" for R she...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22455#discussion_r219402281 --- Diff: R/pkg/R/DataFrame.R --- @@ -244,11 +245,15 @@ setMethod("showDF", #' @note show(SparkDataFrame) since 1.4.0 setMethod("show", "SparkDataFrame", function(object) { -cols <- lapply(dtypes(object), function(l) { - paste(l, collapse = ":") -}) -s <- paste(cols, collapse = ", ") -cat(paste(class(object), "[", s, "]\n", sep = "")) +if (identical(sparkR.conf("spark.sql.repl.eagerEval.enabled", "false")[[1]], "true")) { --- End diff -- ^ that - it's two different "overloads" for two different "class" --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/7 long thread, are we all good with this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21649: [SPARK-23648][R][SQL]Adds more types for hint in SparkR
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/21649 merged to master, thx --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [SPARK-23648][R][SQL] Adds more types for hint in SparkR
Repository: spark Updated Branches: refs/heads/master 76399d75e -> 95b177c8f [SPARK-23648][R][SQL] Adds more types for hint in SparkR ## What changes were proposed in this pull request? Addition of numeric and list hints for SparkR. ## How was this patch tested? Add test in test_sparkSQL.R Author: Huaxin Gao Closes #21649 from huaxingao/spark-23648. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/95b177c8 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/95b177c8 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/95b177c8 Branch: refs/heads/master Commit: 95b177c8f0862c6965a7c3cd76b3935c975adee9 Parents: 76399d7 Author: Huaxin Gao Authored: Wed Sep 19 21:27:30 2018 -0700 Committer: Felix Cheung Committed: Wed Sep 19 21:27:30 2018 -0700 -- R/pkg/R/DataFrame.R | 12 +++- R/pkg/tests/fulltests/test_sparkSQL.R | 9 + 2 files changed, 20 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/95b177c8/R/pkg/R/DataFrame.R -- diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R index 458deca..a1cb478 100644 --- a/R/pkg/R/DataFrame.R +++ b/R/pkg/R/DataFrame.R @@ -3985,7 +3985,17 @@ setMethod("hint", signature(x = "SparkDataFrame", name = "character"), function(x, name, ...) { parameters <- list(...) -stopifnot(all(sapply(parameters, is.character))) +if (!all(sapply(parameters, function(y) { + if (is.character(y) || is.numeric(y)) { +TRUE + } else if (is.list(y)) { +all(sapply(y, function(z) { is.character(z) || is.numeric(z) })) + } else { +FALSE + } +}))) { + stop("sql hint should be character, numeric, or list with character or numeric.") +} jdf <- callJMethod(x@sdf, "hint", name, parameters) dataFrame(jdf) }) http://git-wip-us.apache.org/repos/asf/spark/blob/95b177c8/R/pkg/tests/fulltests/test_sparkSQL.R -- diff --git a/R/pkg/tests/fulltests/test_sparkSQL.R b/R/pkg/tests/fulltests/test_sparkSQL.R index 0c4bdb3..40d8f80 100644 --- a/R/pkg/tests/fulltests/test_sparkSQL.R +++ b/R/pkg/tests/fulltests/test_sparkSQL.R @@ -2419,6 +2419,15 @@ test_that("join(), crossJoin() and merge() on a DataFrame", { expect_true(any(grepl("BroadcastHashJoin", execution_plan_broadcast))) }) +test_that("test hint", { + df <- sql("SELECT * FROM range(10e10)") + hintList <- list("hint2", "hint3", "hint4") + execution_plan_hint <- capture.output( +explain(hint(df, "hint1", 1.23456, "aa", hintList), TRUE) + ) + expect_true(any(grepl("1.23456, aa", execution_plan_hint))) +}) + test_that("toJSON() on DataFrame", { df <- as.DataFrame(cars) df_json <- toJSON(df) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] spark pull request #22455: [SPARK-24572][SPARKR] "eager execution" for R she...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22455#discussion_r219030775 --- Diff: R/pkg/R/DataFrame.R --- @@ -244,11 +244,15 @@ setMethod("showDF", #' @note show(SparkDataFrame) since 1.4.0 setMethod("show", "SparkDataFrame", function(object) { -cols <- lapply(dtypes(object), function(l) { - paste(l, collapse = ":") -}) -s <- paste(cols, collapse = ", ") -cat(paste(class(object), "[", s, "]\n", sep = "")) +if (identical(sparkR.conf("spark.sql.repl.eagerEval.enabled", "false")[[1]], "true")) { --- End diff -- also not sure if it's done for python, consider adding to the doc above (L229) how it behaves with eagerEval --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org