I'd also add [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown &
[SPARK-21040][CORE] Speculate tasks which are running on decommission executors two of the PRs merged after the decommissioning SPIP. On Tue, Jul 21, 2020 at 10:53 AM Xingbo Jiang <jiangxb1...@gmail.com> wrote: > Hi all, > > This is the bi-weekly Apache Spark digest from the Databricks OSS team. > For each API/configuration/behavior change, an *[API] *tag is added in > the title. > > CORE > <https://github.com/databricks/runtime/wiki/OSS-Digest-June-3-~-June-9,-2020#70spark-31923core-ignore-internal-accumulators-that-use-unrecognized-types-rather-than-crashing-63--5>[3.0][SPARK-31923][CORE] > Ignore internal accumulators that use unrecognized types rather than > crashing (+63, -5)> > <https://github.com/apache/spark/commit/b333ed0c4a5733a9c36ad79de1d4c13c6cf3c5d4> > > A user may name his accumulators using the internal.metrics. prefix, so > that Spark treats them as internal accumulators and hides them from UI. We > should make JsonProtocol.accumValueToJson more robust and let it ignore > internal accumulators that use unrecognized types. > > <https://github.com/databricks/runtime/wiki/OSS-Digest-June-3-~-June-9,-2020#api80spark-31486core-sparksubmitwaitappcompletion-flag-to-control-spark-submit-exit-in-standalone-cluster-mode-88--26>[API][3.1][SPARK-31486][CORE] > spark.submit.waitAppCompletion flag to control spark-submit exit in > Standalone Cluster Mode (+88, -26)> > <https://github.com/apache/spark/commit/6befb2d8bdc5743d0333f4839cf301af165582ce> > > This PR implements an application wait mechanism that allows spark-submit to > wait until the application finishes in Standalone mode. This will delay the > exit of spark-submit JVM until the job is completed. This implementation > will keep monitoring the application until it is either finished, failed, > or killed. This will be controlled via the following conf: > > - > > spark.standalone.submit.waitAppCompletion (Default: false) > > In standalone cluster mode, controls whether the client waits to exit > until the application completes. If set to true, the client process > will stay alive polling the driver's status. Otherwise, the client process > will exit after submission. > > > <https://github.com/databricks/runtime/wiki/OSS-Digest-June-3-~-June-9,-2020#sql> > SQL > <https://github.com/databricks/runtime/wiki/OSS-Digest-June-3-~-June-9,-2020#71spark-31220sql-repartition-obeys-initialpartitionnum-when-adaptiveexecutionenabled-27--12>[3.0][SPARK-31220][SQL] > repartition obeys initialPartitionNum when adaptiveExecutionEnabled (+27, > -12)> > <https://github.com/apache/spark/commit/1d1eacde9d1b6fb75a20e4b909d221e70ad737db> > > AQE and non-AQE use different configs to set the initial shuffle partition > number. This PR fixes repartition/DISTRIBUTE BY so that it also uses the > AQE config spark.sql.adaptive.coalescePartitions.initialPartitionNum to > set the initial shuffle partition number if AQE is enabled. > > <https://github.com/databricks/runtime/wiki/OSS-Digest-June-3-~-June-9,-2020#70spark-31867sqlfollowup-check-result-differences-for-datetime-formatting-51--8>[3.0][SPARK-31867][SQL][FOLLOWUP] > Check result differences for datetime formatting (+51, -8)> > <https://github.com/apache/spark/commit/fc6af9d900ec6f6a1cbe8f987857a69e6ef600d1> > > Spark should throw SparkUpgradeException when getting DateTimeException for > datetime formatting in the EXCEPTION legacy Time Parser Policy. > > <https://github.com/databricks/runtime/wiki/OSS-Digest-June-3-~-June-9,-2020#api70spark-31879spark-31892sql-disable-week-based-pattern-letters-in-datetime-parsingformatting-1421--171-102--48>[API][3.0][SPARK-31879][SPARK-31892][SQL] > Disable week-based pattern letters in datetime parsing/formatting (+1421, > -171)> > <https://github.com/apache/spark/commit/9d5b5d0a5849ac329bbae26d9884d8843d8a8571> > (+102, > -48)> > <https://github.com/apache/spark/commit/afe95bd9ad7a07c49deecf05f0a1000bb8f80caa> > > Week-based pattern letters have very weird behaviors during datetime > parsing in Spark 2.4, and it's very hard to simulate the legacy behaviors > with the new API. For formatting, the new API makes the start-of-week > localized, and it's not possible to keep the legacy behaviors. Since the > week-based fields are rarely used, we disable week-based pattern letters in > both parsing and formatting. > > <https://github.com/databricks/runtime/wiki/OSS-Digest-June-3-~-June-9,-2020#70spark-31896sql-handle-am-pm-timestamp-parsing-when-hour-is-missing-39--3>[3.0][SPARK-31896][SQL] > Handle am-pm timestamp parsing when hour is missing (+39, -3)> > <https://github.com/apache/spark/commit/afcc14c6d27f9e0bd113e0d86b64dc6fa4eed551> > > This PR sets the hour field to 0 or 12 when the AMPM_OF_DAY field is AM or > PM during datetime parsing, to keep the behavior the same as Spark 2.4. > > <https://github.com/databricks/runtime/wiki/OSS-Digest-June-3-~-June-9,-2020#api80spark-31830sql-consistent-error-handling-for-datetime-formatting-and-parsing-functions-126--580>[API][3.1][SPARK-31830][SQL] > Consistent error handling for datetime formatting and parsing functions > (+126, -580)> > <https://github.com/apache/spark/commit/6a424b93e5bdb79b1f1310cf48bd034397779e14> > > When parsing/formatting datetime values, it's better to fail fast if the > pattern string is invalid, instead of returning null for each input record. > The formatting functions such as date_format already do it, this PR > applies the fail-fast behavior to parsing functions: from_unixtime, > unix_timestamp,to_unix_timestamp, to_timestamp and to_date. > > <https://github.com/databricks/runtime/wiki/OSS-Digest-June-3-~-June-9,-2020#80spark-31910sql-enable-java-8-time-api-in-thrift-server-23--0>[3.1][SPARK-31910][SQL] > Enable Java 8 time API in Thrift server (+23, -0)> > <https://github.com/apache/spark/commit/2c9988eaf31b7ebd97f2c2904ed7ee531eff0d20> > > This PR enables Java 8 time API in thriftserver, so that we use the > session timezone more consistently. > > <https://github.com/databricks/runtime/wiki/OSS-Digest-June-3-~-June-9,-2020#55spark-31935sql-hadoop-file-system-config-should-be-effective-in-data-source-options-52--7>[2.4][SPARK-31935][SQL] > Hadoop file system config should be effective in data source options (+52, > -7)> > <https://github.com/apache/spark/commit/f3771c6b47d0b3aef10b86586289a1f675c7cfe2> > > This PR fixes a bug that the hadoop configs in read/write options are not > respected in data source V1. > [API][2.4][SPARK-31968][SQL] Duplicate partition columns check when > writing data (+12, -1)> > <https://github.com/apache/spark/commit/a4ea599b1b9b8ebaae0100b54e6ac1d7576c6d8c> > > Add a check for duplicate partition columns when writing built-in file > sources. After the change, when the DataFrame has duplicate partition > columns, the users get an AnalysisException when writing it. Previously, > the writing would succeed, but reading the files with duplicate columns > will fail. > > <https://github.com/databricks/runtime/wiki/OSS-Digest-June-10-~-June-16,-2020#api71spark-26905sql-add-type-in-the-ansi-non-reserved-list-2--0>[API][3.0][SPARK-26905][SQL] > Add TYPE in the ANSI non-reserved list (+2, -0)> > <https://github.com/apache/spark/commit/e14029b18df10db5094f8abf8b9874dbc9186b4e> > > Add TYPE in the ANSI non-reserved list to follow the ANSI/SQL standard. > The change impacts the behavior only when ANSI mode is on > (spark.sql.ansi.enabled=true) > > <https://github.com/databricks/runtime/wiki/OSS-Digest-June-10-~-June-16,-2020#api71spark-26905sql-follow-the-sql2016-reserved-keywords-429--5>[API][3.0][SPARK-26905][SQL] > Follow the SQL:2016 reserved keywords (+429, -5)> > <https://github.com/apache/spark/commit/3698a14204dd861ea3ee3c14aa923123b52caba1> > > Move keywords ANTI, SEMI, and MINUS from reserved to non-reserved to > comply with the ANSI/SQL standard. The change impacts the behavior only > when ANSI mode is on (spark.sql.ansi.enabled=true) > > <https://github.com/databricks/runtime/wiki/OSS-Digest-June-10-~-June-16,-2020#api70spark-31939sqltest-java11-fix-parsing-day-of-year-when-year-field-pattern-is-missing-465--3>[API][3.0][SPARK-31939][SQL][TEST-JAVA11] > Fix Parsing day of year when year field pattern is missing (+465, -3)> > <https://github.com/apache/spark/commit/22dda6e18e91c6db6fa8ff9fafaafe09a79db4ea> > > When a datetime pattern does not contain a year field (ie. 'yyyy') but > contains the day of year field (ie. 'DD'), Spark should still be able to > respect the datetime pattern and parse the constants. > > Before the change: > > spark-sql> select to_timestamp('31', 'DD'); > 1970-01-01 00:00:00 > spark-sql> select to_timestamp('31 30', 'DD dd'); > 1970-01-30 00:00:00 > > After the change: > > spark-sql> select to_timestamp('31', 'DD'); > 1970-01-31 00:00:00 > spark-sql> select to_timestamp('31 30', 'DD dd'); > NULL > > > <https://github.com/databricks/runtime/wiki/OSS-Digest-June-10-~-June-16,-2020#70spark-31956sql-do-not-fail-if-there-is-no-ambiguous-self-join-7--2>[3.0][SPARK-31956][SQL] > Do not fail if there is no ambiguous self join (+7, -2)> > <https://github.com/apache/spark/commit/c40051932290db3a63f80324900a116019b1e589> > > df("col").as("name") is not a column reference anymore, and should not > have the special column metadata that is used to identify the root > attribute (e.g., Dataset ID and col position). This PR fixes the > corresponding regression that could cause a DataFrame could fail even when > there is no ambiguous self-join. Below is an example, > > val joined = df.join(spark.range(1)).select($"a") > joined.select(joined("a").alias("x"), sum(joined("a")).over(w)) > > > <https://github.com/databricks/runtime/wiki/OSS-Digest-June-10-~-June-16,-2020#70spark-31958sql-normalize-special-floating-numbers-in-subquery-18--4>[3.0][SPARK-31958][SQL] > normalize special floating numbers in subquery (+18, -4)> > <https://github.com/apache/spark/commit/6fb9c80da129d0b43f9ff5b8be6ce8bad992a4ed> > > The PR fixes a bug that special floating numbers in non-correlated > subquery expressions are not handled, now the subquery expressions will be > handled by OptimizeSubqueries. > > <https://github.com/databricks/runtime/wiki/OSS-Digest-June-10-~-June-16,-2020#api80spark-21117sql-built-in-sql-function-support---width_bucket-431--30>[API][3.1][SPARK-21117][SQL] > Built-in SQL Function Support - WIDTH_BUCKET (+431, -30)> > <https://github.com/apache/spark/commit/b1adc3deee00058cba669534aee156dc7af243dc> > > Add a built-in SQL function WIDTH_BUCKET, that returns the bucket number > to which value would be assigned in an equiwidth histogram with num_bucket > buckets, > in the range min_value to max_value. Examples: > > > SELECT WIDTH_BUCKET(5.3, 0.2, 10.6, 5); > 3 > > SELECT WIDTH_BUCKET(-2.1, 1.3, 3.4, 3); > 0 > > SELECT WIDTH_BUCKET(8.1, 0.0, 5.7, 4); > 5 > > SELECT WIDTH_BUCKET(-0.9, 5.2, 0.5, 2); > 3 > > > <https://github.com/databricks/runtime/wiki/OSS-Digest-June-10-~-June-16,-2020#80spark-27217sql-nested-column-aliasing-for-more-operators-which-can-prune-nested-column-190--10>[3.1][SPARK-27217][SQL] > Nested column aliasing for more operators which can prune nested column > (+190, -10)> > <https://github.com/apache/spark/commit/43063e2db2bf7469f985f1954d8615b95cf5c578> > > Support nested column pruning from an Aggregate or Expand operator. > > <https://github.com/databricks/runtime/wiki/OSS-Digest-June-10-~-June-16,-2020#80spark-27633sql-remove-redundant-aliases-in-nestedcolumnaliasing-43--1>[3.1][SPARK-27633][SQL] > Remove redundant aliases in NestedColumnAliasing (+43, -1)> > <https://github.com/apache/spark/commit/8282bbf12d4e174986a649023ce3984aae7d7755> > > Avoid generating redundant aliases if the parent nested field is aliased > in the NestedColumnAliasing rule. This slightly improves the performance. > > <https://github.com/databricks/runtime/wiki/OSS-Digest-June-10-~-June-16,-2020#80spark-31736sql-nested-column-aliasing-for-repartitionbyexpressionjoin-197--16>[3.1][SPARK-31736][SQL] > Nested column aliasing for RepartitionByExpression/Join (+197, -16)> > <https://github.com/apache/spark/commit/ff89b1114319e783eb4f4187bf2583e5e21c64e4> > > Support nested column pruning from a RepartitionByExpression or Join > operator. > ML > <https://github.com/databricks/runtime/wiki/OSS-Digest-June-10-~-June-16,-2020#80spark-31925ml-summarytotaliterations-greater-than-maxiters-43--12>[3.1][SPARK-31925][ML] > Summary.totalIterations greater than maxIters (+43, -12)> > <https://github.com/apache/spark/commit/f83cb3cbb3ce3f22fd122bce620917bfd0699ce7> > > The PR fixes a correctness issue in LogisticRegression and > LinearRegression, that the actual round of training iterations was larger > by 1 than the specified maxIter. > > <https://github.com/databricks/runtime/wiki/OSS-Digest-June-10-~-June-16,-2020#80spark-31944-add-instance-weight-support-in-linearregressionsummary-56--24>[3.1][SPARK-31944] > Add instance weight support in LinearRegressionSummary (+56, -24)> > <https://github.com/apache/spark/commit/89c98a4c7068734e322d335cb7c9f22379ff00e8> > > The PR adds instance weight support in LinearRegressionSummary, instance > weight is already supported by LinearRegression and RegressionMetrics. > > <https://github.com/databricks/runtime/wiki/OSS-Digest-June-10-~-June-16,-2020#ss> > SS > <https://github.com/databricks/runtime/wiki/OSS-Digest-June-10-~-June-16,-2020#71spark-31593ss-remove-unnecessary-streaming-query-progress-update-58--7>[3.0][SPARK-31593][SS] > Remove unnecessary streaming query progress update (+58, -7)> > <https://github.com/apache/spark/commit/1e40bccf447dccad9d31bccc75d21b8fca77ba52> > > The PR fixes a bug that sets incorrect metrics in Structured Streaming. We > should make a progress update every 10 seconds when a stream doesn't have > any new data upstream. Without the fix, we zero out the input information > but not the output information when making the progress update. > > <https://github.com/databricks/runtime/wiki/OSS-Digest-June-10-~-June-16,-2020#70spark-31990ss-use-tosettoseq-in-datasetdropduplicates-3--1>[3.0][SPARK-31990][SS] > Use toSet.toSeq in Dataset.dropDuplicates (+3, -1)> > <https://github.com/apache/spark/commit/7f7b4dd5199e7c185aedf51fccc400c7072bed05> > > The PR proposes to preserve the input order of colNames for groupCols in > Dataset.dropDuplicates, because the Streaming's state store depends on > the groupCols order. > > <https://github.com/databricks/runtime/wiki/OSS-Digest-June-10-~-June-16,-2020#80spark-24634ss-add-a-new-metric-regarding-number-of-inputs-later-than-watermark-plus-allowed-delay-94--29>[3.1][SPARK-24634][SS] > Add a new metric regarding number of inputs later than watermark plus > allowed delay (+94, -29)> > <https://github.com/apache/spark/commit/84815d05503460d58b85be52421d5923474aa08b> > > Add a new metrics numLateInputs to count the number of inputs which are > later than watermark ('inputs' are relative to operators). The new metrics > will be provided both on the SparkUI - SQL Tab - query execution details > page, and on the Streaming Query Listener. > > <https://github.com/databricks/runtime/wiki/OSS-Digest-June-3-~-June-9,-2020#python> > PYTHON > <https://github.com/databricks/runtime/wiki/OSS-Digest-June-3-~-June-9,-2020#api70spark-31895pythonsql-support-dataframeexplainextended-str-case-to-be-consistent-with-scala-side-24--11>[API][3.0][SPARK-31895][PYTHON][SQL] > Support DataFrame.explain(extended: str) case to be consistent with Scala > side (+24, -11)> > <https://github.com/apache/spark/commit/e1d52011401c1989f26b230eb8c82adc63e147e7> > > Improves DataFrame.explain in PySpark, so that it takes the explain mode > string as well, which is consistent with the Scala API. > [3.0][SPARK-31915][SQL][PYTHON] Resolve the grouping column properly per > the case sensitivity in grouped and cogrouped pandas UDFs (+37, -8)> > <https://github.com/apache/spark/commit/00d06cad564d5e3e5f78a687776d02fe0695a861> > > The PR proposes to resolve grouping attributes separately first so it can > be properly referred to when FlatMapGroupsInPandas and > FlatMapCoGroupsInPandas are resolved without ambiguity. Example: > > from pyspark.sql.functions import *df = spark.createDataFrame([[1, 1]], > ["column", "Score"])pandas_udf("column integer, Score float", > PandasUDFType.GROUPED_MAP)def my_pandas_udf(pdf): > return pdf.assign(Score=0.5) > df.groupby('COLUMN').apply(my_pandas_udf).show() > > df1 = spark.createDataFrame([(1, 1)], ("column", "value"))df2 = > spark.createDataFrame([(1, 1)], ("column", "value")) > df1.groupby("COLUMN").cogroup( > df2.groupby("COLUMN") > ).applyInPandas(lambda r, l: r + l, df1.schema).show() > > Before: > > pyspark.sql.utils.AnalysisException: Reference 'COLUMN' is ambiguous, could > be: COLUMN, COLUMN.; > > pyspark.sql.utils.AnalysisException: cannot resolve '`COLUMN`' given input > columns: [COLUMN, COLUMN, value, value];; > 'FlatMapCoGroupsInPandas ['COLUMN], ['COLUMN], <lambda>(column#9L, value#10L, > column#13L, value#14L), [column#22L, value#23L] > :- Project [COLUMN#9L, column#9L, value#10L] > : +- LogicalRDD [column#9L, value#10L], false > +- Project [COLUMN#13L, column#13L, value#14L] > +- LogicalRDD [column#13L, value#14L], false > > After: > > +------+-----+ > |column|Score| > +------+-----+ > | 1| 0.5| > +------+-----+ > > +------+-----+ > |column|value| > +------+-----+ > | 2| 2| > +------+-----+ > > > <https://github.com/databricks/runtime/wiki/OSS-Digest-June-10-~-June-16,-2020#80spark-31945sqlpyspark-enable-cache-for-the-same-python-function-25--4>[3.1][SPARK-31945][SQL][PYSPARK] > Enable cache for the same Python function (+25, -4)> > <https://github.com/apache/spark/commit/032d17933b4009ed8a9d70585434ccdbf4d1d7df> > > This PR proposes to make PythonFunction hold Seq[Byte] instead of > Array[Byte]. After the change, it can compare if the byte array has the > same values. With the proposed change, the cache manager will detect the > same function and use the cache for it if it exists. > > <https://github.com/databricks/runtime/wiki/OSS-Digest-June-10-~-June-16,-2020#80spark-31964python-use-pandas-is_categorical-on-arrow-category-type-conversion-2--5>[3.1][SPARK-31964][PYTHON] > Use Pandas is_categorical on Arrow category type conversion (+2, -5)> > <https://github.com/apache/spark/commit/b7ef5294f17d54e7d90e36a4be02e8bd67200144> > > When using PyArrow to convert a Pandas categorical column, use > is_categorical instead of trying to import CategoricalDtype, because the > former is a more stable API. > > <https://github.com/databricks/runtime/wiki/OSS-Digest-June-3-~-June-9,-2020#ui> > UI > <https://github.com/databricks/runtime/wiki/OSS-Digest-June-3-~-June-9,-2020#70spark-31903sqlpysparkr-fix-topandas-with-arrow-enabled-to-show-metrics-in-query-ui-4--4>[3.0][SPARK-31903][SQL][PYSPARK][R] > Fix toPandas with Arrow enabled to show metrics in Query UI (+4, -4)> > <https://github.com/apache/spark/commit/632b5bce23c94d25712b43be83252b34ebfd3e72> > > In Dataset.collectAsArrowToR and Dataset.collectAsArrowToPython, since > the code block for serveToStream is run in the separate thread, withAction > finishes > as soon as it starts the thread. As a result, it doesn't collect the > metrics of the actual action and Query UI shows the plan graph without > metrics. This PR fixes the issue. > > The affected functions are: > > - collect() in SparkR > - DataFrame.toPandas() in PySpark > > > <https://github.com/databricks/runtime/wiki/OSS-Digest-June-3-~-June-9,-2020#70spark-31886webui-fix-the-wrong-coloring-of-nodes-in-dag-viz-33--3>[3.0][SPARK-31886][WEBUI] > Fix the wrong coloring of nodes in DAG-viz (+33, -3)> > <https://github.com/apache/spark/commit/8ed93c9355bc2af6fe456d88aa693c8db69d0bbf> > > In the Job Page and Stage Page, nodes which are associated with "barrier > mode" in the DAG-viz will be colored pale green. But, with some types of > jobs, nodes which are not associated with the mode will also be colored. > This PR fixes it. > > <https://github.com/databricks/runtime/wiki/OSS-Digest-June-3-~-June-9,-2020#80spark-29431webui-improve-web-ui--sql-tab-visualization-with-cached-dataframes-46--0>[3.1][SPARK-29431][WEBUI] > Improve Web UI / Sql tab visualization with cached dataframes (+46, -0)> > <https://github.com/apache/spark/commit/e4db3b5b1742b4bdfa32937273e5d07a76cde79b> > > Display the query plan of cached DataFrames as well in the web UI. > [2.4][SPARK-31967][UI] Downgrade to vis.js 4.21.0 to fix Jobs UI loading > time regression (+49, -86)> > <https://github.com/apache/spark/commit/f535004e14b197ceb1f2108a67b033c052d65bcb> > > Fix the serious perf issue in web UI by falling back vis-timeline-graph2d > to 4.21.0. > > <https://github.com/databricks/runtime/wiki/OSS-Digest-June-10-~-June-16,-2020#80spark-30119webui-support-pagination-for-streaming-tab-259--178>[3.1][SPARK-30119][WEBUI] > Support pagination for streaming tab (+259, -178)> > <https://github.com/apache/spark/commit/9b098f1eb91a5e9f488d573bfeea3f6bfd9b95b3> > > The PR adds pagination support for the streaming tab. > > <https://github.com/databricks/runtime/wiki/OSS-Digest-June-10-~-June-16,-2020#80spark-31642followup-fix-sorting-for-duration-column-and-make-status-column-sortable-7--6>[3.1][SPARK-31642][FOLLOWUP] > Fix Sorting for duration column and make Status column sortable (+7, -6)> > <https://github.com/apache/spark/commit/f5f6eee3045e90e02fc7e999f616b5a021d7c724> > > The PR improves the pagination support in the streaming job, by fixing the > wrong sorting result and making Status column sortable. > > > -- Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau