Re: [OSS DIGEST] The major changes of Apache Spark from June 3 to June 16

2020-07-21 Thread Holden Karau
Got it, I missed the date in the reading :)

On Tue, Jul 21, 2020 at 11:23 AM Xingbo Jiang  wrote:

> Hi Holden,
>
> This is the digest for commits merged between *June 3 and June 16.* The
> commits you mentioned would be included in the future digests.
>
> Cheers,
>
> Xingbo
>
> On Tue, Jul 21, 2020 at 11:13 AM Holden Karau 
> wrote:
>
>> I'd also add [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are
>> being shutdown &
>>
>> [SPARK-21040][CORE] Speculate tasks which are running on decommission
>> executors two of the PRs merged after the decommissioning SPIP.
>>
>> On Tue, Jul 21, 2020 at 10:53 AM Xingbo Jiang 
>> wrote:
>>
>>> Hi all,
>>>
>>> This is the bi-weekly Apache Spark digest from the Databricks OSS team.
>>> For each API/configuration/behavior change, an *[API] *tag is added in
>>> the title.
>>>
>>> CORE
>>> [3.0][SPARK-31923][CORE]
>>> Ignore internal accumulators that use unrecognized types rather than
>>> crashing (+63, -5)>
>>> 
>>>
>>> A user may name his accumulators using the internal.metrics. prefix, so
>>> that Spark treats them as internal accumulators and hides them from UI. We
>>> should make JsonProtocol.accumValueToJson more robust and let it ignore
>>> internal accumulators that use unrecognized types.
>>>
>>> [API][3.1][SPARK-31486][CORE]
>>> spark.submit.waitAppCompletion flag to control spark-submit exit in
>>> Standalone Cluster Mode (+88, -26)>
>>> 
>>>
>>> This PR implements an application wait mechanism that allows
>>> spark-submit to wait until the application finishes in Standalone mode.
>>> This will delay the exit of spark-submit JVM until the job is
>>> completed. This implementation will keep monitoring the application until
>>> it is either finished, failed, or killed. This will be controlled via the
>>> following conf:
>>>
>>>-
>>>
>>>spark.standalone.submit.waitAppCompletion (Default: false)
>>>
>>>In standalone cluster mode, controls whether the client waits to
>>>exit until the application completes. If set to true, the client
>>>process will stay alive polling the driver's status. Otherwise, the 
>>> client
>>>process will exit after submission.
>>>
>>>
>>> 
>>> SQL
>>> [3.0][SPARK-31220][SQL]
>>> repartition obeys initialPartitionNum when adaptiveExecutionEnabled (+27,
>>> -12)>
>>> 
>>>
>>> AQE and non-AQE use different configs to set the initial shuffle
>>> partition number. This PR fixes repartition/DISTRIBUTE BY so that it
>>> also uses the AQE config
>>> spark.sql.adaptive.coalescePartitions.initialPartitionNum to set the
>>> initial shuffle partition number if AQE is enabled.
>>>
>>> [3.0][SPARK-31867][SQL][FOLLOWUP]
>>> Check result differences for datetime formatting (+51, -8)>
>>> 
>>>
>>> Spark should throw SparkUpgradeException when getting DateTimeException for
>>> datetime formatting in the EXCEPTION legacy Time Parser Policy.
>>>
>>> [API][3.0][SPARK-31879][SPARK-31892][SQL]
>>> Disable week-based pattern letters in datetime parsing/formatting (+1421,
>>> -171)>
>>> 
>>>  (+102,
>>> -48)>
>>> 
>>>
>>> Week-based pattern letters have very weird behaviors during datetime
>>> parsing in Spark 2.4, and it's very hard to simulate the legacy behaviors
>>> with the new API. For formatting, the new API makes the start-of-week
>>> localized, and it's not possible to keep the legacy behaviors. Since the
>>> week-based fields are rarely used, we disable week-based pattern letters in
>>> both parsing and formatting.
>>>
>>> 

Re: [OSS DIGEST] The major changes of Apache Spark from June 3 to June 16

2020-07-21 Thread Xingbo Jiang
Hi Holden,

This is the digest for commits merged between *June 3 and June 16.* The
commits you mentioned would be included in the future digests.

Cheers,

Xingbo

On Tue, Jul 21, 2020 at 11:13 AM Holden Karau  wrote:

> I'd also add [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are
> being shutdown &
>
> [SPARK-21040][CORE] Speculate tasks which are running on decommission
> executors two of the PRs merged after the decommissioning SPIP.
>
> On Tue, Jul 21, 2020 at 10:53 AM Xingbo Jiang 
> wrote:
>
>> Hi all,
>>
>> This is the bi-weekly Apache Spark digest from the Databricks OSS team.
>> For each API/configuration/behavior change, an *[API] *tag is added in
>> the title.
>>
>> CORE
>> [3.0][SPARK-31923][CORE]
>> Ignore internal accumulators that use unrecognized types rather than
>> crashing (+63, -5)>
>> 
>>
>> A user may name his accumulators using the internal.metrics. prefix, so
>> that Spark treats them as internal accumulators and hides them from UI. We
>> should make JsonProtocol.accumValueToJson more robust and let it ignore
>> internal accumulators that use unrecognized types.
>>
>> [API][3.1][SPARK-31486][CORE]
>> spark.submit.waitAppCompletion flag to control spark-submit exit in
>> Standalone Cluster Mode (+88, -26)>
>> 
>>
>> This PR implements an application wait mechanism that allows spark-submit to
>> wait until the application finishes in Standalone mode. This will delay the
>> exit of spark-submit JVM until the job is completed. This implementation
>> will keep monitoring the application until it is either finished, failed,
>> or killed. This will be controlled via the following conf:
>>
>>-
>>
>>spark.standalone.submit.waitAppCompletion (Default: false)
>>
>>In standalone cluster mode, controls whether the client waits to exit
>>until the application completes. If set to true, the client process
>>will stay alive polling the driver's status. Otherwise, the client process
>>will exit after submission.
>>
>>
>> 
>> SQL
>> [3.0][SPARK-31220][SQL]
>> repartition obeys initialPartitionNum when adaptiveExecutionEnabled (+27,
>> -12)>
>> 
>>
>> AQE and non-AQE use different configs to set the initial shuffle
>> partition number. This PR fixes repartition/DISTRIBUTE BY so that it
>> also uses the AQE config
>> spark.sql.adaptive.coalescePartitions.initialPartitionNum to set the
>> initial shuffle partition number if AQE is enabled.
>>
>> [3.0][SPARK-31867][SQL][FOLLOWUP]
>> Check result differences for datetime formatting (+51, -8)>
>> 
>>
>> Spark should throw SparkUpgradeException when getting DateTimeException for
>> datetime formatting in the EXCEPTION legacy Time Parser Policy.
>>
>> [API][3.0][SPARK-31879][SPARK-31892][SQL]
>> Disable week-based pattern letters in datetime parsing/formatting (+1421,
>> -171)>
>> 
>>  (+102,
>> -48)>
>> 
>>
>> Week-based pattern letters have very weird behaviors during datetime
>> parsing in Spark 2.4, and it's very hard to simulate the legacy behaviors
>> with the new API. For formatting, the new API makes the start-of-week
>> localized, and it's not possible to keep the legacy behaviors. Since the
>> week-based fields are rarely used, we disable week-based pattern letters in
>> both parsing and formatting.
>>
>> [3.0][SPARK-31896][SQL]
>> Handle am-pm timestamp parsing when hour is missing (+39, -3)>
>> 

Re: [OSS DIGEST] The major changes of Apache Spark from June 3 to June 16

2020-07-21 Thread Holden Karau
I'd also add [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are
being shutdown &

[SPARK-21040][CORE] Speculate tasks which are running on decommission
executors two of the PRs merged after the decommissioning SPIP.

On Tue, Jul 21, 2020 at 10:53 AM Xingbo Jiang  wrote:

> Hi all,
>
> This is the bi-weekly Apache Spark digest from the Databricks OSS team.
> For each API/configuration/behavior change, an *[API] *tag is added in
> the title.
>
> CORE
> [3.0][SPARK-31923][CORE]
> Ignore internal accumulators that use unrecognized types rather than
> crashing (+63, -5)>
> 
>
> A user may name his accumulators using the internal.metrics. prefix, so
> that Spark treats them as internal accumulators and hides them from UI. We
> should make JsonProtocol.accumValueToJson more robust and let it ignore
> internal accumulators that use unrecognized types.
>
> [API][3.1][SPARK-31486][CORE]
> spark.submit.waitAppCompletion flag to control spark-submit exit in
> Standalone Cluster Mode (+88, -26)>
> 
>
> This PR implements an application wait mechanism that allows spark-submit to
> wait until the application finishes in Standalone mode. This will delay the
> exit of spark-submit JVM until the job is completed. This implementation
> will keep monitoring the application until it is either finished, failed,
> or killed. This will be controlled via the following conf:
>
>-
>
>spark.standalone.submit.waitAppCompletion (Default: false)
>
>In standalone cluster mode, controls whether the client waits to exit
>until the application completes. If set to true, the client process
>will stay alive polling the driver's status. Otherwise, the client process
>will exit after submission.
>
>
> 
> SQL
> [3.0][SPARK-31220][SQL]
> repartition obeys initialPartitionNum when adaptiveExecutionEnabled (+27,
> -12)>
> 
>
> AQE and non-AQE use different configs to set the initial shuffle partition
> number. This PR fixes repartition/DISTRIBUTE BY so that it also uses the
> AQE config spark.sql.adaptive.coalescePartitions.initialPartitionNum to
> set the initial shuffle partition number if AQE is enabled.
>
> [3.0][SPARK-31867][SQL][FOLLOWUP]
> Check result differences for datetime formatting (+51, -8)>
> 
>
> Spark should throw SparkUpgradeException when getting DateTimeException for
> datetime formatting in the EXCEPTION legacy Time Parser Policy.
>
> [API][3.0][SPARK-31879][SPARK-31892][SQL]
> Disable week-based pattern letters in datetime parsing/formatting (+1421,
> -171)>
> 
>  (+102,
> -48)>
> 
>
> Week-based pattern letters have very weird behaviors during datetime
> parsing in Spark 2.4, and it's very hard to simulate the legacy behaviors
> with the new API. For formatting, the new API makes the start-of-week
> localized, and it's not possible to keep the legacy behaviors. Since the
> week-based fields are rarely used, we disable week-based pattern letters in
> both parsing and formatting.
>
> [3.0][SPARK-31896][SQL]
> Handle am-pm timestamp parsing when hour is missing (+39, -3)>
> 
>
> This PR sets the hour field to 0 or 12 when the AMPM_OF_DAY field is AM or
> PM during datetime parsing, to keep the behavior the same as Spark 2.4.
>
> 

[OSS DIGEST] The major changes of Apache Spark from June 3 to June 16

2020-07-21 Thread Xingbo Jiang
Hi all,

This is the bi-weekly Apache Spark digest from the Databricks OSS team.
For each API/configuration/behavior change, an *[API] *tag is added in the
title.

CORE
[3.0][SPARK-31923][CORE]
Ignore internal accumulators that use unrecognized types rather than
crashing (+63, -5)>


A user may name his accumulators using the internal.metrics. prefix, so
that Spark treats them as internal accumulators and hides them from UI. We
should make JsonProtocol.accumValueToJson more robust and let it ignore
internal accumulators that use unrecognized types.
[API][3.1][SPARK-31486][CORE]
spark.submit.waitAppCompletion flag to control spark-submit exit in
Standalone Cluster Mode (+88, -26)>


This PR implements an application wait mechanism that allows spark-submit to
wait until the application finishes in Standalone mode. This will delay the
exit of spark-submit JVM until the job is completed. This implementation
will keep monitoring the application until it is either finished, failed,
or killed. This will be controlled via the following conf:

   -

   spark.standalone.submit.waitAppCompletion (Default: false)

   In standalone cluster mode, controls whether the client waits to exit
   until the application completes. If set to true, the client process will
   stay alive polling the driver's status. Otherwise, the client process will
   exit after submission.


SQL
[3.0][SPARK-31220][SQL]
repartition obeys initialPartitionNum when adaptiveExecutionEnabled (+27,
-12)>


AQE and non-AQE use different configs to set the initial shuffle partition
number. This PR fixes repartition/DISTRIBUTE BY so that it also uses the
AQE config spark.sql.adaptive.coalescePartitions.initialPartitionNum to set
the initial shuffle partition number if AQE is enabled.
[3.0][SPARK-31867][SQL][FOLLOWUP]
Check result differences for datetime formatting (+51, -8)>


Spark should throw SparkUpgradeException when getting DateTimeException for
datetime formatting in the EXCEPTION legacy Time Parser Policy.
[API][3.0][SPARK-31879][SPARK-31892][SQL]
Disable week-based pattern letters in datetime parsing/formatting (+1421,
-171)>

(+102,
-48)>


Week-based pattern letters have very weird behaviors during datetime
parsing in Spark 2.4, and it's very hard to simulate the legacy behaviors
with the new API. For formatting, the new API makes the start-of-week
localized, and it's not possible to keep the legacy behaviors. Since the
week-based fields are rarely used, we disable week-based pattern letters in
both parsing and formatting.
[3.0][SPARK-31896][SQL]
Handle am-pm timestamp parsing when hour is missing (+39, -3)>


This PR sets the hour field to 0 or 12 when the AMPM_OF_DAY field is AM or
PM during datetime parsing, to keep the behavior the same as Spark 2.4.
[API][3.1][SPARK-31830][SQL]
Consistent error handling for datetime formatting and parsing functions
(+126, -580)>


When parsing/formatting datetime values, it's better to fail fast if the
pattern string is invalid, instead of returning null for each input record.
The formatting f