[spark] branch master updated: [SPARK-26853][SQL] Add example and version for commonly used aggregate function descriptions
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 5a74036 [SPARK-26853][SQL] Add example and version for commonly used aggregate function descriptions 5a74036 is described below commit 5a7403623d0525c23ab8ae575e9d1383e3e10635 Author: Dilip Biswal AuthorDate: Mon Feb 11 23:24:54 2019 -0800 [SPARK-26853][SQL] Add example and version for commonly used aggregate function descriptions ## What changes were proposed in this pull request? This improves the expression description for commonly used aggregate functions such as Max, Min, Count, etc. ## How was this patch tested? Verified the function description manually from the shell. Closes #23756 from dilipbiswal/dkb_expr_description_aggregate. Authored-by: Dilip Biswal Signed-off-by: Dongjoon Hyun --- .../aggregate/ApproximatePercentile.scala | 3 +- .../catalyst/expressions/aggregate/Average.scala | 10 - .../expressions/aggregate/CentralMomentAgg.scala | 52 +++--- .../sql/catalyst/expressions/aggregate/Corr.scala | 8 +++- .../sql/catalyst/expressions/aggregate/Count.scala | 12 - .../expressions/aggregate/CountMinSketchAgg.scala | 3 +- .../expressions/aggregate/Covariance.scala | 16 ++- .../sql/catalyst/expressions/aggregate/First.scala | 13 +- .../aggregate/HyperLogLogPlusPlus.scala| 9 +++- .../sql/catalyst/expressions/aggregate/Last.scala | 13 +- .../sql/catalyst/expressions/aggregate/Max.scala | 8 +++- .../sql/catalyst/expressions/aggregate/Min.scala | 8 +++- .../expressions/aggregate/Percentile.scala | 10 - .../sql/catalyst/expressions/aggregate/Sum.scala | 12 - .../expressions/aggregate/UnevaluableAggs.scala| 27 +++ .../catalyst/expressions/aggregate/collect.scala | 16 ++- 16 files changed, 195 insertions(+), 25 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala index c790d87..ea0ed2e 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala @@ -64,7 +64,8 @@ import org.apache.spark.sql.types._ [10.0,10.0,10.0] > SELECT _FUNC_(10.0, 0.5, 100); 10.0 - """) + """, + since = "2.1.0") case class ApproximatePercentile( child: Expression, percentageExpression: Expression, diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Average.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Average.scala index 8dd80dc..66ac730 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Average.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Average.scala @@ -24,7 +24,15 @@ import org.apache.spark.sql.catalyst.util.TypeUtils import org.apache.spark.sql.types._ @ExpressionDescription( - usage = "_FUNC_(expr) - Returns the mean calculated from values of a group.") + usage = "_FUNC_(expr) - Returns the mean calculated from values of a group.", + examples = """ +Examples: + > SELECT _FUNC_(col) FROM VALUES (1), (2), (3) AS tab(col); + 2.0 + > SELECT _FUNC_(col) FROM VALUES (1), (2), (NULL) AS tab(col); + 1.5 + """, + since = "1.0.0") case class Average(child: Expression) extends DeclarativeAggregate with ImplicitCastInputTypes { override def prettyName: String = "avg" diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala index e2ff0ef..1870c58 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala @@ -135,7 +135,13 @@ abstract class CentralMomentAgg(child: Expression) // Compute the population standard deviation of a column // scalastyle:off line.size.limit @ExpressionDescription( - usage = "_FUNC_(expr) - Returns the population standard deviation calculated from values of a group.") + usage = "_FUNC_(expr) - Returns the population standard deviation calculated from values of a group.", + examples = """ +Examples: + > SELECT _FUNC_(col) FROM VALUES (1), (2), (3) AS tab(col); + 0.816496580927726 + """, + since = "1.6.0") //
[GitHub] felixcheung merged pull request #179: Update Thomas Graves Information
felixcheung merged pull request #179: Update Thomas Graves Information URL: https://github.com/apache/spark-website/pull/179 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark-website] branch asf-site updated: Update Thomas Graves Information (#179)
This is an automated email from the ASF dual-hosted git repository. felixcheung pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/spark-website.git The following commit(s) were added to refs/heads/asf-site by this push: new 6205847 Update Thomas Graves Information (#179) 6205847 is described below commit 6205847f851303dd8eee80dd25236a875ca3256a Author: Thomas Graves AuthorDate: Tue Feb 12 01:29:42 2019 -0600 Update Thomas Graves Information (#179) * Update Thomas Graves Information Update my company info. I ran jekyll build and server and it looked fine. * capitalize --- committers.md| 2 +- site/committers.html | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/committers.md b/committers.md index c5c4104..97b21a4 100644 --- a/committers.md +++ b/committers.md @@ -26,7 +26,7 @@ navigation: |Robert Evans|Oath| |Wenchen Fan|Databricks| |Joseph Gonzalez|UC Berkeley| -|Thomas Graves|Oath| +|Thomas Graves|NVIDIA| |Stephen Haberman|LinkedIn| |Mark Hamstra|ClearStory Data| |Seth Hendrickson|Cloudera| diff --git a/site/committers.html b/site/committers.html index 42c9171..a690ca7 100644 --- a/site/committers.html +++ b/site/committers.html @@ -276,7 +276,7 @@ Thomas Graves - Oath + NVIDIA Stephen Haberman - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-25158][SQL] Executor accidentally exit because ScriptTransformationWriterThread throw Exception.
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 5864e8e [SPARK-25158][SQL] Executor accidentally exit because ScriptTransformationWriterThread throw Exception. 5864e8e is described below commit 5864e8e47496c3a841b97632e5137de87a91efea Author: yangjie01 AuthorDate: Tue Feb 12 12:16:33 2019 +0800 [SPARK-25158][SQL] Executor accidentally exit because ScriptTransformationWriterThread throw Exception. ## What changes were proposed in this pull request? Run Spark-Sql job use transform features(`ScriptTransformationExec`) with config `spark.speculation = true`, sometimes job fails and we found many Executor Dead through `Executor Tab`, through analysis log and code we found : `ScriptTransformationExec` start a new thread(`ScriptTransformationWriterThread`), the new thread is very likely to throw `TaskKilledException`(from iter.map.foreach part) when speculation is on, this exception will captured by `SparkUncaughtExceptionHandler` which registered during Executor start, `SparkUncaughtExceptionHandler` will call `System.exit (SparkExitCode.UNCAUGHT_EXCEPTION)` to shutdown `Executor`, this is unexpected. We should not kill the executor just because `ScriptTransformationWriterThread` fails. log the error(not only `TaskKilledException`) instead of throwing it is enough, Exception already pass to `ScriptTransformationExec` and handle by `TaskRunner`. ## How was this patch tested? Register `TestUncaughtExceptionHandler` to test case in `ScriptTransformationSuite`, then assert there is no Uncaught Exception handled. Before this patch "script transformation should not swallow errors from upstream operators (no serde)" and "script transformation should not swallow errors from upstream operators (with serde)" throwing `IllegalArgumentException` and handle by `TestUncaughtExceptionHandler` . Closes #22149 from LuciferYang/fix-transformation-task-kill. Authored-by: yangjie01 Signed-off-by: Wenchen Fan --- .../hive/execution/ScriptTransformationExec.scala | 5 +++- .../spark/sql/hive/execution/SQLQuerySuite.scala | 33 +- .../hive/execution/ScriptTransformationSuite.scala | 32 - .../execution/TestUncaughtExceptionHandler.scala | 31 4 files changed, 98 insertions(+), 3 deletions(-) diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala index 7b35a5f..905cb52 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala @@ -308,12 +308,15 @@ private class ScriptTransformationWriterThread( } threwException = false } catch { + // SPARK-25158 Exception should not be thrown again, otherwise it will be captured by + // SparkUncaughtExceptionHandler, then Executor will exit because of this Uncaught Exception, + // so pass the exception to `ScriptTransformationExec` is enough. case t: Throwable => // An error occurred while writing input, so kill the child process. According to the // Javadoc this call will not throw an exception: _exception = t proc.destroy() -throw t +logError("Thread-ScriptTransformation-Feed exit cause by: ", t) } finally { try { Utils.tryLogNonFatalError(outputStream.close()) diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala index d506edc..ce7661a 100644 --- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala @@ -26,7 +26,7 @@ import java.util.{Locale, Set} import com.google.common.io.Files import org.apache.hadoop.fs.{FileSystem, Path} -import org.apache.spark.TestUtils +import org.apache.spark.{SparkException, TestUtils} import org.apache.spark.sql._ import org.apache.spark.sql.catalyst.TableIdentifier import org.apache.spark.sql.catalyst.analysis.{EliminateSubqueryAliases, FunctionRegistry} @@ -2348,4 +2348,35 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { } } + test("SPARK-25158: " + +"Executor accidentally exit because ScriptTransformationWriterThread throw Exception") { +withTempView("test") { + val defaultUncaughtExceptionHandler = Thread.getDefaultUncaughtExceptionHandler + try { +val
[GitHub] tgravescs commented on a change in pull request #179: Update Thomas Graves Information
tgravescs commented on a change in pull request #179: Update Thomas Graves Information URL: https://github.com/apache/spark-website/pull/179#discussion_r255792402 ## File path: committers.md ## @@ -26,7 +26,7 @@ navigation: |Robert Evans|Oath| |Wenchen Fan|Databricks| |Joseph Gonzalez|UC Berkeley| -|Thomas Graves|Oath| +|Thomas Graves|Nvidia| Review comment: Yes it should be.. thanks :) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] srowen commented on a change in pull request #179: Update Thomas Graves Information
srowen commented on a change in pull request #179: Update Thomas Graves Information URL: https://github.com/apache/spark-website/pull/179#discussion_r255787307 ## File path: committers.md ## @@ -26,7 +26,7 @@ navigation: |Robert Evans|Oath| |Wenchen Fan|Databricks| |Joseph Gonzalez|UC Berkeley| -|Thomas Graves|Oath| +|Thomas Graves|Nvidia| Review comment: Super nit but isn't the company name NVIDIA :) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-26696][SQL] Makes Dataset encoder public
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new b34b4c5 [SPARK-26696][SQL] Makes Dataset encoder public b34b4c5 is described below commit b34b4c59b4bcfa2c6007aa36e705d55af961d13f Author: Simeon Simeonov AuthorDate: Tue Feb 12 11:04:26 2019 +0800 [SPARK-26696][SQL] Makes Dataset encoder public ## What changes were proposed in this pull request? Implements the solution proposed in [SPARK-26696](https://issues.apache.org/jira/browse/SPARK-26696), a minor refactoring that allows frameworks to perform advanced type-preserving dataset transformations without carrying `Encoder` implicits from user code. The change allows ```scala def foo[A](ds: Dataset[A]): Dataset[A] = ds.toDF().as[A](ds.encoder) ``` instead of ```scala def foo[A: Encoder](ds: Dataset[A]): Dataset[A] = ds.toDF().as[A](implicitly[Encoder[A]]) ``` ## How was this patch tested? This patch was tested with an automated test that was later removed as it was deemed unnecessary per the discussion in this PR. Closes #23620 from ssimeonov/ss_SPARK-26696. Authored-by: Simeon Simeonov Signed-off-by: Wenchen Fan --- sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala index 32f6234..8a26152 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala @@ -178,7 +178,7 @@ private[sql] object Dataset { class Dataset[T] private[sql]( @transient val sparkSession: SparkSession, @DeveloperApi @Unstable @transient val queryExecution: QueryExecution, -encoder: Encoder[T]) +@DeveloperApi @Unstable @transient val encoder: Encoder[T]) extends Serializable { queryExecution.assertAnalyzed() - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-26740][SPARK-26654][SQL] Make statistics of timestamp/date columns independent from system time zones
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 9c6efd0 [SPARK-26740][SPARK-26654][SQL] Make statistics of timestamp/date columns independent from system time zones 9c6efd0 is described below commit 9c6efd0427b9268dafc901fbdecf8a6dec738654 Author: Maxim Gekk AuthorDate: Tue Feb 12 10:58:00 2019 +0800 [SPARK-26740][SPARK-26654][SQL] Make statistics of timestamp/date columns independent from system time zones ## What changes were proposed in this pull request? In the PR, I propose to covert underlying types of timestamp/date columns to strings, and store the converted values as column statistics. This makes statistics for timestamp/date columns independent from system time zone while saving and retrieving such statistics. I bumped versions of stored statistics from 1 to 2 since the PR changes the format. ## How was this patch tested? The changes were tested by `StatisticsCollectionSuite` and by `StatisticsSuite`. Closes #23662 from MaxGekk/column-stats-time-date. Lead-authored-by: Maxim Gekk Co-authored-by: Maxim Gekk Co-authored-by: Hyukjin Kwon Signed-off-by: Wenchen Fan --- .../spark/sql/catalyst/catalog/interface.scala | 36 - .../sql/catalyst/plans/logical/Statistics.scala| 7 ++- .../spark/sql/StatisticsCollectionSuite.scala | 43 +++ .../spark/sql/StatisticsCollectionTestBase.scala | 62 -- 4 files changed, 105 insertions(+), 43 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala index 817abeb..69b5cb4 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala @@ -30,7 +30,7 @@ import org.apache.spark.sql.catalyst.analysis.MultiInstanceRelation import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap, AttributeReference, Cast, ExprId, Literal} import org.apache.spark.sql.catalyst.plans.logical._ import org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils -import org.apache.spark.sql.catalyst.util.{CaseInsensitiveMap, DateTimeUtils} +import org.apache.spark.sql.catalyst.util.{CaseInsensitiveMap, DateFormatter, DateTimeUtils, TimestampFormatter} import org.apache.spark.sql.catalyst.util.quoteIdentifier import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types._ @@ -415,7 +415,8 @@ case class CatalogColumnStat( nullCount: Option[BigInt] = None, avgLen: Option[Long] = None, maxLen: Option[Long] = None, -histogram: Option[Histogram] = None) { +histogram: Option[Histogram] = None, +version: Int = CatalogColumnStat.VERSION) { /** * Returns a map from string to string that can be used to serialize the column stats. @@ -429,7 +430,7 @@ case class CatalogColumnStat( */ def toMap(colName: String): Map[String, String] = { val map = new scala.collection.mutable.HashMap[String, String] -map.put(s"${colName}.${CatalogColumnStat.KEY_VERSION}", "1") +map.put(s"${colName}.${CatalogColumnStat.KEY_VERSION}", CatalogColumnStat.VERSION.toString) distinctCount.foreach { v => map.put(s"${colName}.${CatalogColumnStat.KEY_DISTINCT_COUNT}", v.toString) } @@ -452,12 +453,13 @@ case class CatalogColumnStat( dataType: DataType): ColumnStat = ColumnStat( distinctCount = distinctCount, - min = min.map(CatalogColumnStat.fromExternalString(_, colName, dataType)), - max = max.map(CatalogColumnStat.fromExternalString(_, colName, dataType)), + min = min.map(CatalogColumnStat.fromExternalString(_, colName, dataType, version)), + max = max.map(CatalogColumnStat.fromExternalString(_, colName, dataType, version)), nullCount = nullCount, avgLen = avgLen, maxLen = maxLen, - histogram = histogram) + histogram = histogram, + version = version) } object CatalogColumnStat extends Logging { @@ -472,14 +474,23 @@ object CatalogColumnStat extends Logging { private val KEY_MAX_LEN = "maxLen" private val KEY_HISTOGRAM = "histogram" + val VERSION = 2 + + private def getTimestampFormatter(): TimestampFormatter = { +TimestampFormatter(format = "-MM-dd HH:mm:ss.SS", timeZone = DateTimeUtils.TimeZoneUTC) + } + /** * Converts from string representation of data type to the corresponding Catalyst data type. */ - def fromExternalString(s: String, name: String, dataType: DataType): Any = { + def fromExternalString(s: String, name: String, dataType: DataType, version: Int): Any = {
[GitHub] HyukjinKwon commented on issue #179: Update Thomas Graves Information
HyukjinKwon commented on issue #179: Update Thomas Graves Information URL: https://github.com/apache/spark-website/pull/179#issuecomment-462592293 I'm pretty sure it can be just pushed .. :D.. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] tgravescs commented on issue #179: Update Thomas Graves Information
tgravescs commented on issue #179: Update Thomas Graves Information URL: https://github.com/apache/spark-website/pull/179#issuecomment-462585798 @srowen @squito have time for a quick review? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] maziyarpanahi opened a new pull request #180: Add Natural Language Processing for Apache Spark to third party list
maziyarpanahi opened a new pull request #180: Add Natural Language Processing for Apache Spark to third party list URL: https://github.com/apache/spark-website/pull/180 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org