Re: [OSS DIGEST] The major changes of Apache Spark from June 3 to June 16

Holden Karau Tue, 21 Jul 2020 12:25:25 -0700

Got it, I missed the date in the reading :)

On Tue, Jul 21, 2020 at 11:23 AM Xingbo Jiang <jiangxb1...@gmail.com> wrote:


> Hi Holden,
>
> This is the digest for commits merged between *June 3 and June 16.* The
> commits you mentioned would be included in the future digests.
>
> Cheers,
>
> Xingbo
>
> On Tue, Jul 21, 2020 at 11:13 AM Holden Karau <hol...@pigscanfly.ca>
> wrote:
>
>> I'd also add [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are
>> being shutdown &
>>
>> [SPARK-21040][CORE] Speculate tasks which are running on decommission
>> executors two of the PRs merged after the decommissioning SPIP.
>>
>> On Tue, Jul 21, 2020 at 10:53 AM Xingbo Jiang <jiangxb1...@gmail.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> This is the bi-weekly Apache Spark digest from the Databricks OSS team.
>>> For each API/configuration/behavior change, an *[API] *tag is added in
>>> the title.
>>>
>>> CORE
>>> <https://github.com/databricks/runtime/wiki/OSS-Digest-June-3-~-June-9,-2020#70spark-31923core-ignore-internal-accumulators-that-use-unrecognized-types-rather-than-crashing-63--5>[3.0][SPARK-31923][CORE]
>>> Ignore internal accumulators that use unrecognized types rather than
>>> crashing (+63, -5)>
>>> <https://github.com/apache/spark/commit/b333ed0c4a5733a9c36ad79de1d4c13c6cf3c5d4>
>>>
>>> A user may name his accumulators using the internal.metrics. prefix, so
>>> that Spark treats them as internal accumulators and hides them from UI. We
>>> should make JsonProtocol.accumValueToJson more robust and let it ignore
>>> internal accumulators that use unrecognized types.
>>>
>>> <https://github.com/databricks/runtime/wiki/OSS-Digest-June-3-~-June-9,-2020#api80spark-31486core-sparksubmitwaitappcompletion-flag-to-control-spark-submit-exit-in-standalone-cluster-mode-88--26>[API][3.1][SPARK-31486][CORE]
>>> spark.submit.waitAppCompletion flag to control spark-submit exit in
>>> Standalone Cluster Mode (+88, -26)>
>>> <https://github.com/apache/spark/commit/6befb2d8bdc5743d0333f4839cf301af165582ce>
>>>
>>> This PR implements an application wait mechanism that allows
>>> spark-submit to wait until the application finishes in Standalone mode.
>>> This will delay the exit of spark-submit JVM until the job is
>>> completed. This implementation will keep monitoring the application until
>>> it is either finished, failed, or killed. This will be controlled via the
>>> following conf:
>>>
>>>    -
>>>
>>>    spark.standalone.submit.waitAppCompletion (Default: false)
>>>
>>>    In standalone cluster mode, controls whether the client waits to
>>>    exit until the application completes. If set to true, the client
>>>    process will stay alive polling the driver's status. Otherwise, the 
>>> client
>>>    process will exit after submission.
>>>
>>>
>>> <https://github.com/databricks/runtime/wiki/OSS-Digest-June-3-~-June-9,-2020#sql>
>>> SQL
>>> <https://github.com/databricks/runtime/wiki/OSS-Digest-June-3-~-June-9,-2020#71spark-31220sql-repartition-obeys-initialpartitionnum-when-adaptiveexecutionenabled-27--12>[3.0][SPARK-31220][SQL]
>>> repartition obeys initialPartitionNum when adaptiveExecutionEnabled (+27,
>>> -12)>
>>> <https://github.com/apache/spark/commit/1d1eacde9d1b6fb75a20e4b909d221e70ad737db>
>>>
>>> AQE and non-AQE use different configs to set the initial shuffle
>>> partition number. This PR fixes repartition/DISTRIBUTE BY so that it
>>> also uses the AQE config
>>> spark.sql.adaptive.coalescePartitions.initialPartitionNum to set the
>>> initial shuffle partition number if AQE is enabled.
>>>
>>> <https://github.com/databricks/runtime/wiki/OSS-Digest-June-3-~-June-9,-2020#70spark-31867sqlfollowup-check-result-differences-for-datetime-formatting-51--8>[3.0][SPARK-31867][SQL][FOLLOWUP]
>>> Check result differences for datetime formatting (+51, -8)>
>>> <https://github.com/apache/spark/commit/fc6af9d900ec6f6a1cbe8f987857a69e6ef600d1>
>>>
>>> Spark should throw SparkUpgradeException when getting DateTimeException for
>>> datetime formatting in the EXCEPTION legacy Time Parser Policy.
>>>
>>> <https://github.com/databricks/runtime/wiki/OSS-Digest-June-3-~-June-9,-2020#api70spark-31879spark-31892sql-disable-week-based-pattern-letters-in-datetime-parsingformatting-1421--171-102--48>[API][3.0][SPARK-31879][SPARK-31892][SQL]
>>> Disable week-based pattern letters in datetime parsing/formatting (+1421,
>>> -171)>
>>> <https://github.com/apache/spark/commit/9d5b5d0a5849ac329bbae26d9884d8843d8a8571>
>>>  (+102,
>>> -48)>
>>> <https://github.com/apache/spark/commit/afe95bd9ad7a07c49deecf05f0a1000bb8f80caa>
>>>
>>> Week-based pattern letters have very weird behaviors during datetime
>>> parsing in Spark 2.4, and it's very hard to simulate the legacy behaviors
>>> with the new API. For formatting, the new API makes the start-of-week
>>> localized, and it's not possible to keep the legacy behaviors. Since the
>>> week-based fields are rarely used, we disable week-based pattern letters in
>>> both parsing and formatting.
>>>
>>> <https://github.com/databricks/runtime/wiki/OSS-Digest-June-3-~-June-9,-2020#70spark-31896sql-handle-am-pm-timestamp-parsing-when-hour-is-missing-39--3>[3.0][SPARK-31896][SQL]
>>> Handle am-pm timestamp parsing when hour is missing (+39, -3)>
>>> <https://github.com/apache/spark/commit/afcc14c6d27f9e0bd113e0d86b64dc6fa4eed551>
>>>
>>> This PR sets the hour field to 0 or 12 when the AMPM_OF_DAY field is AM
>>> or PM during datetime parsing, to keep the behavior the same as Spark 2.4.
>>>
>>> <https://github.com/databricks/runtime/wiki/OSS-Digest-June-3-~-June-9,-2020#api80spark-31830sql-consistent-error-handling-for-datetime-formatting-and-parsing-functions-126--580>[API][3.1][SPARK-31830][SQL]
>>> Consistent error handling for datetime formatting and parsing functions
>>> (+126, -580)>
>>> <https://github.com/apache/spark/commit/6a424b93e5bdb79b1f1310cf48bd034397779e14>
>>>
>>> When parsing/formatting datetime values, it's better to fail fast if the
>>> pattern string is invalid, instead of returning null for each input record.
>>> The formatting functions such as date_format already do it, this PR
>>> applies the fail-fast behavior to parsing functions: from_unixtime,
>>> unix_timestamp,to_unix_timestamp, to_timestamp and to_date.
>>>
>>> <https://github.com/databricks/runtime/wiki/OSS-Digest-June-3-~-June-9,-2020#80spark-31910sql-enable-java-8-time-api-in-thrift-server-23--0>[3.1][SPARK-31910][SQL]
>>> Enable Java 8 time API in Thrift server (+23, -0)>
>>> <https://github.com/apache/spark/commit/2c9988eaf31b7ebd97f2c2904ed7ee531eff0d20>
>>>
>>> This PR enables Java 8 time API in thriftserver, so that we use the
>>> session timezone more consistently.
>>>
>>> <https://github.com/databricks/runtime/wiki/OSS-Digest-June-3-~-June-9,-2020#55spark-31935sql-hadoop-file-system-config-should-be-effective-in-data-source-options-52--7>[2.4][SPARK-31935][SQL]
>>> Hadoop file system config should be effective in data source options (+52,
>>> -7)>
>>> <https://github.com/apache/spark/commit/f3771c6b47d0b3aef10b86586289a1f675c7cfe2>
>>>
>>> This PR fixes a bug that the hadoop configs in read/write options are
>>> not respected in data source V1.
>>> [API][2.4][SPARK-31968][SQL] Duplicate partition columns check when
>>> writing data (+12, -1)>
>>> <https://github.com/apache/spark/commit/a4ea599b1b9b8ebaae0100b54e6ac1d7576c6d8c>
>>>
>>> Add a check for duplicate partition columns when writing built-in file
>>> sources. After the change, when the DataFrame has duplicate partition
>>> columns, the users get an AnalysisException when writing it.
>>> Previously, the writing would succeed, but reading the files with duplicate
>>> columns will fail.
>>>
>>> <https://github.com/databricks/runtime/wiki/OSS-Digest-June-10-~-June-16,-2020#api71spark-26905sql-add-type-in-the-ansi-non-reserved-list-2--0>[API][3.0][SPARK-26905][SQL]
>>> Add TYPE in the ANSI non-reserved list (+2, -0)>
>>> <https://github.com/apache/spark/commit/e14029b18df10db5094f8abf8b9874dbc9186b4e>
>>>
>>> Add TYPE in the ANSI non-reserved list to follow the ANSI/SQL standard.
>>> The change impacts the behavior only when ANSI mode is on
>>> (spark.sql.ansi.enabled=true)
>>>
>>> <https://github.com/databricks/runtime/wiki/OSS-Digest-June-10-~-June-16,-2020#api71spark-26905sql-follow-the-sql2016-reserved-keywords-429--5>[API][3.0][SPARK-26905][SQL]
>>> Follow the SQL:2016 reserved keywords (+429, -5)>
>>> <https://github.com/apache/spark/commit/3698a14204dd861ea3ee3c14aa923123b52caba1>
>>>
>>> Move keywords ANTI, SEMI, and MINUS from reserved to non-reserved to
>>> comply with the ANSI/SQL standard. The change impacts the behavior only
>>> when ANSI mode is on (spark.sql.ansi.enabled=true)
>>>
>>> <https://github.com/databricks/runtime/wiki/OSS-Digest-June-10-~-June-16,-2020#api70spark-31939sqltest-java11-fix-parsing-day-of-year-when-year-field-pattern-is-missing-465--3>[API][3.0][SPARK-31939][SQL][TEST-JAVA11]
>>> Fix Parsing day of year when year field pattern is missing (+465, -3)>
>>> <https://github.com/apache/spark/commit/22dda6e18e91c6db6fa8ff9fafaafe09a79db4ea>
>>>
>>> When a datetime pattern does not contain a year field (ie. 'yyyy') but
>>> contains the day of year field (ie. 'DD'), Spark should still be able to
>>> respect the datetime pattern and parse the constants.
>>>
>>> Before the change:
>>>
>>> spark-sql> select to_timestamp('31', 'DD');
>>> 1970-01-01 00:00:00
>>> spark-sql> select to_timestamp('31 30', 'DD dd');
>>> 1970-01-30 00:00:00
>>>
>>> After the change:
>>>
>>> spark-sql> select to_timestamp('31', 'DD');
>>> 1970-01-31 00:00:00
>>> spark-sql> select to_timestamp('31 30', 'DD dd');
>>> NULL
>>>
>>>
>>> <https://github.com/databricks/runtime/wiki/OSS-Digest-June-10-~-June-16,-2020#70spark-31956sql-do-not-fail-if-there-is-no-ambiguous-self-join-7--2>[3.0][SPARK-31956][SQL]
>>> Do not fail if there is no ambiguous self join (+7, -2)>
>>> <https://github.com/apache/spark/commit/c40051932290db3a63f80324900a116019b1e589>
>>>
>>> df("col").as("name") is not a column reference anymore, and should not
>>> have the special column metadata that is used to identify the root
>>> attribute (e.g., Dataset ID and col position). This PR fixes the
>>> corresponding regression that could cause a DataFrame could fail even when
>>> there is no ambiguous self-join. Below is an example,
>>>
>>> val joined = df.join(spark.range(1)).select($"a")
>>> joined.select(joined("a").alias("x"), sum(joined("a")).over(w))
>>>
>>>
>>> <https://github.com/databricks/runtime/wiki/OSS-Digest-June-10-~-June-16,-2020#70spark-31958sql-normalize-special-floating-numbers-in-subquery-18--4>[3.0][SPARK-31958][SQL]
>>> normalize special floating numbers in subquery (+18, -4)>
>>> <https://github.com/apache/spark/commit/6fb9c80da129d0b43f9ff5b8be6ce8bad992a4ed>
>>>
>>> The PR fixes a bug that special floating numbers in non-correlated
>>> subquery expressions are not handled, now the subquery expressions will be
>>> handled by OptimizeSubqueries.
>>>
>>> <https://github.com/databricks/runtime/wiki/OSS-Digest-June-10-~-June-16,-2020#api80spark-21117sql-built-in-sql-function-support---width_bucket-431--30>[API][3.1][SPARK-21117][SQL]
>>> Built-in SQL Function Support - WIDTH_BUCKET (+431, -30)>
>>> <https://github.com/apache/spark/commit/b1adc3deee00058cba669534aee156dc7af243dc>
>>>
>>> Add a built-in SQL function WIDTH_BUCKET, that returns the bucket
>>> number to which value would be assigned in an equiwidth histogram with
>>> num_bucket buckets, in the range min_value to max_value. Examples:
>>>
>>> > SELECT WIDTH_BUCKET(5.3, 0.2, 10.6, 5);
>>> 3
>>> > SELECT WIDTH_BUCKET(-2.1, 1.3, 3.4, 3);
>>> 0
>>> > SELECT WIDTH_BUCKET(8.1, 0.0, 5.7, 4);
>>> 5
>>> > SELECT WIDTH_BUCKET(-0.9, 5.2, 0.5, 2);
>>> 3
>>>
>>>
>>> <https://github.com/databricks/runtime/wiki/OSS-Digest-June-10-~-June-16,-2020#80spark-27217sql-nested-column-aliasing-for-more-operators-which-can-prune-nested-column-190--10>[3.1][SPARK-27217][SQL]
>>> Nested column aliasing for more operators which can prune nested column
>>> (+190, -10)>
>>> <https://github.com/apache/spark/commit/43063e2db2bf7469f985f1954d8615b95cf5c578>
>>>
>>> Support nested column pruning from an Aggregate or Expand operator.
>>>
>>> <https://github.com/databricks/runtime/wiki/OSS-Digest-June-10-~-June-16,-2020#80spark-27633sql-remove-redundant-aliases-in-nestedcolumnaliasing-43--1>[3.1][SPARK-27633][SQL]
>>> Remove redundant aliases in NestedColumnAliasing (+43, -1)>
>>> <https://github.com/apache/spark/commit/8282bbf12d4e174986a649023ce3984aae7d7755>
>>>
>>> Avoid generating redundant aliases if the parent nested field is aliased
>>> in the NestedColumnAliasing rule. This slightly improves the performance.
>>>
>>> <https://github.com/databricks/runtime/wiki/OSS-Digest-June-10-~-June-16,-2020#80spark-31736sql-nested-column-aliasing-for-repartitionbyexpressionjoin-197--16>[3.1][SPARK-31736][SQL]
>>> Nested column aliasing for RepartitionByExpression/Join (+197, -16)>
>>> <https://github.com/apache/spark/commit/ff89b1114319e783eb4f4187bf2583e5e21c64e4>
>>>
>>> Support nested column pruning from a RepartitionByExpression or Join
>>> operator.
>>> ML
>>> <https://github.com/databricks/runtime/wiki/OSS-Digest-June-10-~-June-16,-2020#80spark-31925ml-summarytotaliterations-greater-than-maxiters-43--12>[3.1][SPARK-31925][ML]
>>> Summary.totalIterations greater than maxIters (+43, -12)>
>>> <https://github.com/apache/spark/commit/f83cb3cbb3ce3f22fd122bce620917bfd0699ce7>
>>>
>>> The PR fixes a correctness issue in LogisticRegression and
>>> LinearRegression, that the actual round of training iterations was larger
>>> by 1 than the specified maxIter.
>>>
>>> <https://github.com/databricks/runtime/wiki/OSS-Digest-June-10-~-June-16,-2020#80spark-31944-add-instance-weight-support-in-linearregressionsummary-56--24>[3.1][SPARK-31944]
>>> Add instance weight support in LinearRegressionSummary (+56, -24)>
>>> <https://github.com/apache/spark/commit/89c98a4c7068734e322d335cb7c9f22379ff00e8>
>>>
>>> The PR adds instance weight support in LinearRegressionSummary, instance
>>> weight is already supported by LinearRegression and RegressionMetrics.
>>>
>>> <https://github.com/databricks/runtime/wiki/OSS-Digest-June-10-~-June-16,-2020#ss>
>>> SS
>>> <https://github.com/databricks/runtime/wiki/OSS-Digest-June-10-~-June-16,-2020#71spark-31593ss-remove-unnecessary-streaming-query-progress-update-58--7>[3.0][SPARK-31593][SS]
>>> Remove unnecessary streaming query progress update (+58, -7)>
>>> <https://github.com/apache/spark/commit/1e40bccf447dccad9d31bccc75d21b8fca77ba52>
>>>
>>> The PR fixes a bug that sets incorrect metrics in Structured Streaming.
>>> We should make a progress update every 10 seconds when a stream doesn't
>>> have any new data upstream. Without the fix, we zero out the input
>>> information but not the output information when making the progress update.
>>>
>>> <https://github.com/databricks/runtime/wiki/OSS-Digest-June-10-~-June-16,-2020#70spark-31990ss-use-tosettoseq-in-datasetdropduplicates-3--1>[3.0][SPARK-31990][SS]
>>> Use toSet.toSeq in Dataset.dropDuplicates (+3, -1)>
>>> <https://github.com/apache/spark/commit/7f7b4dd5199e7c185aedf51fccc400c7072bed05>
>>>
>>> The PR proposes to preserve the input order of colNames for groupCols
>>>  in Dataset.dropDuplicates, because the Streaming's state store depends
>>> on the groupCols order.
>>>
>>> <https://github.com/databricks/runtime/wiki/OSS-Digest-June-10-~-June-16,-2020#80spark-24634ss-add-a-new-metric-regarding-number-of-inputs-later-than-watermark-plus-allowed-delay-94--29>[3.1][SPARK-24634][SS]
>>> Add a new metric regarding number of inputs later than watermark plus
>>> allowed delay (+94, -29)>
>>> <https://github.com/apache/spark/commit/84815d05503460d58b85be52421d5923474aa08b>
>>>
>>> Add a new metrics numLateInputs to count the number of inputs which are
>>> later than watermark ('inputs' are relative to operators). The new metrics
>>> will be provided both on the SparkUI - SQL Tab - query execution details
>>> page, and on the Streaming Query Listener.
>>>
>>> <https://github.com/databricks/runtime/wiki/OSS-Digest-June-3-~-June-9,-2020#python>
>>> PYTHON
>>> <https://github.com/databricks/runtime/wiki/OSS-Digest-June-3-~-June-9,-2020#api70spark-31895pythonsql-support-dataframeexplainextended-str-case-to-be-consistent-with-scala-side-24--11>[API][3.0][SPARK-31895][PYTHON][SQL]
>>> Support DataFrame.explain(extended: str) case to be consistent with Scala
>>> side (+24, -11)>
>>> <https://github.com/apache/spark/commit/e1d52011401c1989f26b230eb8c82adc63e147e7>
>>>
>>> Improves DataFrame.explain in PySpark, so that it takes the explain
>>> mode string as well, which is consistent with the Scala API.
>>> [3.0][SPARK-31915][SQL][PYTHON] Resolve the grouping column properly per
>>> the case sensitivity in grouped and cogrouped pandas UDFs (+37, -8)>
>>> <https://github.com/apache/spark/commit/00d06cad564d5e3e5f78a687776d02fe0695a861>
>>>
>>> The PR proposes to resolve grouping attributes separately first so it
>>> can be properly referred to when FlatMapGroupsInPandas and
>>> FlatMapCoGroupsInPandas are resolved without ambiguity. Example:
>>>
>>> from pyspark.sql.functions import *df = spark.createDataFrame([[1, 1]], 
>>> ["column", "Score"])pandas_udf("column integer, Score float", 
>>> PandasUDFType.GROUPED_MAP)def my_pandas_udf(pdf):
>>>     return pdf.assign(Score=0.5)
>>> df.groupby('COLUMN').apply(my_pandas_udf).show()
>>>
>>> df1 = spark.createDataFrame([(1, 1)], ("column", "value"))df2 = 
>>> spark.createDataFrame([(1, 1)], ("column", "value"))
>>> df1.groupby("COLUMN").cogroup(
>>>     df2.groupby("COLUMN")
>>> ).applyInPandas(lambda r, l: r + l, df1.schema).show()
>>>
>>> Before:
>>>
>>> pyspark.sql.utils.AnalysisException: Reference 'COLUMN' is ambiguous, could 
>>> be: COLUMN, COLUMN.;
>>>
>>> pyspark.sql.utils.AnalysisException: cannot resolve '`COLUMN`' given input 
>>> columns: [COLUMN, COLUMN, value, value];;
>>> 'FlatMapCoGroupsInPandas ['COLUMN], ['COLUMN], <lambda>(column#9L, 
>>> value#10L, column#13L, value#14L), [column#22L, value#23L]
>>> :- Project [COLUMN#9L, column#9L, value#10L]
>>> :  +- LogicalRDD [column#9L, value#10L], false
>>> +- Project [COLUMN#13L, column#13L, value#14L]
>>>    +- LogicalRDD [column#13L, value#14L], false
>>>
>>> After:
>>>
>>> +------+-----+
>>> |column|Score|
>>> +------+-----+
>>> |     1|  0.5|
>>> +------+-----+
>>>
>>> +------+-----+
>>> |column|value|
>>> +------+-----+
>>> |     2|    2|
>>> +------+-----+
>>>
>>>
>>> <https://github.com/databricks/runtime/wiki/OSS-Digest-June-10-~-June-16,-2020#80spark-31945sqlpyspark-enable-cache-for-the-same-python-function-25--4>[3.1][SPARK-31945][SQL][PYSPARK]
>>> Enable cache for the same Python function (+25, -4)>
>>> <https://github.com/apache/spark/commit/032d17933b4009ed8a9d70585434ccdbf4d1d7df>
>>>
>>> This PR proposes to make PythonFunction hold Seq[Byte] instead of
>>> Array[Byte]. After the change, it can compare if the byte array has the
>>> same values. With the proposed change, the cache manager will detect the
>>> same function and use the cache for it if it exists.
>>>
>>> <https://github.com/databricks/runtime/wiki/OSS-Digest-June-10-~-June-16,-2020#80spark-31964python-use-pandas-is_categorical-on-arrow-category-type-conversion-2--5>[3.1][SPARK-31964][PYTHON]
>>> Use Pandas is_categorical on Arrow category type conversion (+2, -5)>
>>> <https://github.com/apache/spark/commit/b7ef5294f17d54e7d90e36a4be02e8bd67200144>
>>>
>>> When using PyArrow to convert a Pandas categorical column, use
>>> is_categorical instead of trying to import CategoricalDtype, because
>>> the former is a more stable API.
>>>
>>> <https://github.com/databricks/runtime/wiki/OSS-Digest-June-3-~-June-9,-2020#ui>
>>> UI
>>> <https://github.com/databricks/runtime/wiki/OSS-Digest-June-3-~-June-9,-2020#70spark-31903sqlpysparkr-fix-topandas-with-arrow-enabled-to-show-metrics-in-query-ui-4--4>[3.0][SPARK-31903][SQL][PYSPARK][R]
>>> Fix toPandas with Arrow enabled to show metrics in Query UI (+4, -4)>
>>> <https://github.com/apache/spark/commit/632b5bce23c94d25712b43be83252b34ebfd3e72>
>>>
>>> In Dataset.collectAsArrowToR and Dataset.collectAsArrowToPython, since
>>> the code block for serveToStream is run in the separate thread,
>>> withAction finishes as soon as it starts the thread. As a result, it
>>> doesn't collect the metrics of the actual action and Query UI shows the
>>> plan graph without metrics. This PR fixes the issue.
>>>
>>> The affected functions are:
>>>
>>>    - collect() in SparkR
>>>    - DataFrame.toPandas() in PySpark
>>>
>>>
>>> <https://github.com/databricks/runtime/wiki/OSS-Digest-June-3-~-June-9,-2020#70spark-31886webui-fix-the-wrong-coloring-of-nodes-in-dag-viz-33--3>[3.0][SPARK-31886][WEBUI]
>>> Fix the wrong coloring of nodes in DAG-viz (+33, -3)>
>>> <https://github.com/apache/spark/commit/8ed93c9355bc2af6fe456d88aa693c8db69d0bbf>
>>>
>>> In the Job Page and Stage Page, nodes which are associated with "barrier
>>> mode" in the DAG-viz will be colored pale green. But, with some types of
>>> jobs, nodes which are not associated with the mode will also be colored.
>>> This PR fixes it.
>>>
>>> <https://github.com/databricks/runtime/wiki/OSS-Digest-June-3-~-June-9,-2020#80spark-29431webui-improve-web-ui--sql-tab-visualization-with-cached-dataframes-46--0>[3.1][SPARK-29431][WEBUI]
>>> Improve Web UI / Sql tab visualization with cached dataframes (+46, -0)>
>>> <https://github.com/apache/spark/commit/e4db3b5b1742b4bdfa32937273e5d07a76cde79b>
>>>
>>> Display the query plan of cached DataFrames as well in the web UI.
>>> [2.4][SPARK-31967][UI] Downgrade to vis.js 4.21.0 to fix Jobs UI loading
>>> time regression (+49, -86)>
>>> <https://github.com/apache/spark/commit/f535004e14b197ceb1f2108a67b033c052d65bcb>
>>>
>>> Fix the serious perf issue in web UI by falling back
>>> vis-timeline-graph2d to 4.21.0.
>>>
>>> <https://github.com/databricks/runtime/wiki/OSS-Digest-June-10-~-June-16,-2020#80spark-30119webui-support-pagination-for-streaming-tab-259--178>[3.1][SPARK-30119][WEBUI]
>>> Support pagination for streaming tab (+259, -178)>
>>> <https://github.com/apache/spark/commit/9b098f1eb91a5e9f488d573bfeea3f6bfd9b95b3>
>>>
>>> The PR adds pagination support for the streaming tab.
>>>
>>> <https://github.com/databricks/runtime/wiki/OSS-Digest-June-10-~-June-16,-2020#80spark-31642followup-fix-sorting-for-duration-column-and-make-status-column-sortable-7--6>[3.1][SPARK-31642][FOLLOWUP]
>>> Fix Sorting for duration column and make Status column sortable (+7, -6)
>>> >
>>> <https://github.com/apache/spark/commit/f5f6eee3045e90e02fc7e999f616b5a021d7c724>
>>>
>>> The PR improves the pagination support in the streaming job, by fixing
>>> the wrong sorting result and making Status column sortable.
>>>
>>>
>>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: [OSS DIGEST] The major changes of Apache Spark from June 3 to June 16

Reply via email to