Re: [VOTE] Spark 3.1.3 RC3

2022-02-02 Thread Mridul Muralidharan
Hi,

  Minor nit: the tag mentioned under [1] looks like a typo - I used
"v3.1.3-rc3"  for my vote (3.2.1 is mentioned in a couple of places, treat
them as 3.1.3 instead)

+1
Signatures, digests, etc check out fine.
Checked out tag and build/tested with -Pyarn -Pmesos -Pkubernetes

Regards,
Mridul

[1] "The tag to be voted on is v3.2.1-rc1" - the commit hash and git url
are correct.


On Wed, Feb 2, 2022 at 9:30 AM Mridul Muralidharan  wrote:

>
> Thanks Tom !
> I missed [1] (or probably forgot) the 3.1 part of the discussion given it
> centered around 3.2 ...
>
>
> Regards,
> Mridul
>
> [1] https://www.mail-archive.com/dev@spark.apache.org/msg28484.html
>
> On Wed, Feb 2, 2022 at 8:55 AM Thomas Graves  wrote:
>
>> It was discussed doing all the maintenance lines back at beginning of
>> December (Dec 6) when we were talking about release 3.2.1.
>>
>> Tom
>>
>> On Wed, Feb 2, 2022 at 2:07 AM Mridul Muralidharan 
>> wrote:
>> >
>> > Hi Holden,
>> >
>> >   Not that I am against releasing 3.1.3 (given the fixes that have
>> already gone in), but did we discuss releasing it ? I might have missed the
>> thread ...
>> >
>> > Regards,
>> > Mridul
>> >
>> > On Tue, Feb 1, 2022 at 7:12 PM Holden Karau 
>> wrote:
>> >>
>> >> Please vote on releasing the following candidate as Apache Spark
>> version 3.1.3.
>> >>
>> >> The vote is open until Feb. 4th at 5 PM PST (1 AM UTC + 1 day) and
>> passes if a majority
>> >> +1 PMC votes are cast, with a minimum of 3 + 1 votes.
>> >>
>> >> [ ] +1 Release this package as Apache Spark 3.1.3
>> >> [ ] -1 Do not release this package because ...
>> >>
>> >> To learn more about Apache Spark, please see http://spark.apache.org/
>> >>
>> >> There are currently no open issues targeting 3.1.3 in Spark's JIRA
>> https://issues.apache.org/jira/browse
>> >> (try project = SPARK AND "Target Version/s" = "3.1.3" AND status in
>> (Open, Reopened, "In Progress"))
>> >> at https://s.apache.org/n79dw
>> >>
>> >>
>> >>
>> >> The tag to be voted on is v3.2.1-rc1 (commit
>> >> b8c0799a8cef22c56132d94033759c9f82b0cc86):
>> >> https://github.com/apache/spark/tree/v3.1.3-rc3
>> >>
>> >> The release files, including signatures, digests, etc. can be found at:
>> >> https://dist.apache.org/repos/dist/dev/spark/v3.1.3-rc3-bin/
>> >>
>> >> Signatures used for Spark RCs can be found in this file:
>> >> https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >>
>> >> The staging repository for this release can be found at
>> >> :
>> https://repository.apache.org/content/repositories/orgapachespark-1400/
>> >>
>> >> The documentation corresponding to this release can be found at:
>> >> https://dist.apache.org/repos/dist/dev/spark/v3.1.3-rc3-docs/
>> >>
>> >> The list of bug fixes going into 3.1.3 can be found at the following
>> URL:
>> >> https://s.apache.org/x0q9b
>> >>
>> >> This release is using the release script in master as of
>> ddc77fb906cb3ce1567d277c2d0850104c89ac25
>> >> The release docker container was rebuilt since the previous version
>> didn't have the necessary components to build the R documentation.
>> >>
>> >> FAQ
>> >>
>> >>
>> >> =
>> >> How can I help test this release?
>> >> =
>> >>
>> >> If you are a Spark user, you can help us test this release by taking
>> >> an existing Spark workload and running on this release candidate, then
>> >> reporting any regressions.
>> >>
>> >> If you're working in PySpark you can set up a virtual env and install
>> >> the current RC and see if anything important breaks, in the Java/Scala
>> >> you can add the staging repository to your projects resolvers and test
>> >> with the RC (make sure to clean up the artifact cache before/after so
>> >> you don't end up building with an out of date RC going forward).
>> >>
>> >> ===
>> >> What should happen to JIRA tickets still targeting 3.1.3?
>> >> ===
>> >>
>> >> The current list of open tickets targeted at 3.2.1 can be found at:
>> >> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> >> Version/s" = 3.1.3
>> >>
>> >> Committers should look at those and triage. Extremely important bug
>> >> fixes, documentation, and API tweaks that impact compatibility should
>> >> be worked on immediately. Everything else please retarget to an
>> >> appropriate release.
>> >>
>> >> ==
>> >> But my bug isn't fixed?
>> >> ==
>> >>
>> >> In order to make timely releases, we will typically not hold the
>> >> release unless the bug in question is a regression from the previous
>> >> release. That being said, if there is something that is a regression
>> >> that has not been correctly targeted please ping me or a committer to
>> >> help target the issue.
>> >>
>> >> ==
>> >> What happened to RC1 & RC2?
>> >> ==
>> >>
>> >> When I first went to build RC1 the build process failed due to the
>> >> lack of the R markdown package 

Re: MetadataFetchFailedException due to decommission block migrations

2022-02-02 Thread Dongjoon Hyun
Thank you for sharing, Emil.

> I willing to help up to develop a fix, but might need some guidance of
> how this case could be handled better.

Could you file an official Apache JIRA for your finding and
propose a PR for that too with the test case? We can continue
our discussion on your PR.

Dongjoon.



On Wed, Feb 2, 2022 at 3:59 AM Emil Ejbyfeldt
 wrote:

> As noted in SPARK-34939 there is race when using broadcast for map
> output status. Explanation from SPARK-34939
>
>  > After map statuses are broadcasted and the executors obtain
> serialized broadcasted map statuses. If any fetch failure happens after,
> Spark scheduler invalidates cached map statuses and destroy broadcasted
> value of the map statuses. Then any executor trying to deserialize
> serialized broadcasted map statuses and access broadcasted value,
> IOException will be thrown. Currently we don't catch it in
> MapOutputTrackerWorker and above exception will fail the application.
>
> But if running with `spark.decommission.enabled=true` and
> `spark.storage.decommission.shuffleBlocks.enabled=true` there is another
> way to hit this race, when a node is decommissioning and the shuffle
> blocks are migrated. After a block has been migrated an update will be
> sent to the driver for each block and the map output caches will be
> invalidated.
>
> Here are a driver when we hit the race condition running with spark 3.2.0:
>
> 2022-01-28 03:20:12,409 INFO memory.MemoryStore: Block broadcast_27
> stored as values in memory (estimated size 5.5 MiB, free 11.0 GiB)
> 2022-01-28 03:20:12,410 INFO spark.ShuffleStatus: Updating map output
> for 192108 to BlockManagerId(760, ip-10-231-63-204.ec2.internal, 34707,
> None)
> 2022-01-28 03:20:12,410 INFO spark.ShuffleStatus: Updating map output
> for 179529 to BlockManagerId(743, ip-10-231-34-160.ec2.internal, 44225,
> None)
> 2022-01-28 03:20:12,414 INFO spark.ShuffleStatus: Updating map output
> for 187194 to BlockManagerId(761, ip-10-231-43-219.ec2.internal, 39943,
> None)
> 2022-01-28 03:20:12,415 INFO spark.ShuffleStatus: Updating map output
> for 190303 to BlockManagerId(270, ip-10-231-33-206.ec2.internal, 38965,
> None)
> 2022-01-28 03:20:12,416 INFO spark.ShuffleStatus: Updating map output
> for 192220 to BlockManagerId(270, ip-10-231-33-206.ec2.internal, 38965,
> None)
> 2022-01-28 03:20:12,416 INFO spark.ShuffleStatus: Updating map output
> for 182306 to BlockManagerId(688, ip-10-231-43-41.ec2.internal, 35967,
> None)
> 2022-01-28 03:20:12,417 INFO spark.ShuffleStatus: Updating map output
> for 190387 to BlockManagerId(772, ip-10-231-55-173.ec2.internal, 35523,
> None)
> 2022-01-28 03:20:12,417 INFO memory.MemoryStore: Block
> broadcast_27_piece0 stored as bytes in memory (estimated size 4.0 MiB,
> free 10.9 GiB)
> 2022-01-28 03:20:12,417 INFO storage.BlockManagerInfo: Added
> broadcast_27_piece0 in memory on ip-10-231-63-1.ec2.internal:34761
> (size: 4.0 MiB, free: 11.0 GiB)
> 2022-01-28 03:20:12,418 INFO memory.MemoryStore: Block
> broadcast_27_piece1 stored as bytes in memory (estimated size 1520.4
> KiB, free 10.9 GiB)
> 2022-01-28 03:20:12,418 INFO storage.BlockManagerInfo: Added
> broadcast_27_piece1 in memory on ip-10-231-63-1.ec2.internal:34761
> (size: 1520.4 KiB, free: 11.0 GiB)
> 2022-01-28 03:20:12,418 INFO spark.MapOutputTracker: Broadcast
> outputstatuses size = 416, actual size = 5747443
> 2022-01-28 03:20:12,419 INFO spark.ShuffleStatus: Updating map output
> for 153389 to BlockManagerId(154, ip-10-231-42-104.ec2.internal, 44717,
> None)
> 2022-01-28 03:20:12,419 INFO broadcast.TorrentBroadcast: Destroying
> Broadcast(27) (from updateMapOutput at
> BlockManagerMasterEndpoint.scala:594)
> 2022-01-28 03:20:12,427 INFO storage.BlockManagerInfo: Added
> rdd_65_20310 on disk on ip-10-231-32-25.ec2.internal:40657 (size: 77.6 MiB)
> 2022-01-28 03:20:12,427 INFO storage.BlockManagerInfo: Removed
> broadcast_27_piece0 on ip-10-231-63-1.ec2.internal:34761 in memory
> (size: 4.0 MiB, free: 11.0 GiB)
>
> While the Broadcast is being constructed we have updates coming in and
> the broadcast is destroyed almost immediately. On this particular job we
> ended up hitting the race condition a lot of times and it caused ~18
> task failures and stage retries within 20 seconds causing us to hit our
> stage retry limit and the job to fail.
>
> As far I understand this was the expected behavior for handling this
> case after SPARK-34939. But it seems like when combined with
> decommissioning hitting the race is a bit too common.
>
> Anyone else running it something similar?
>
> I willing to help up to develop a fix, but might need some guidance of
> how this case could be handled better.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] Spark 3.1.3 RC3

2022-02-02 Thread Mridul Muralidharan
Thanks Tom !
I missed [1] (or probably forgot) the 3.1 part of the discussion given it
centered around 3.2 ...


Regards,
Mridul

[1] https://www.mail-archive.com/dev@spark.apache.org/msg28484.html

On Wed, Feb 2, 2022 at 8:55 AM Thomas Graves  wrote:

> It was discussed doing all the maintenance lines back at beginning of
> December (Dec 6) when we were talking about release 3.2.1.
>
> Tom
>
> On Wed, Feb 2, 2022 at 2:07 AM Mridul Muralidharan 
> wrote:
> >
> > Hi Holden,
> >
> >   Not that I am against releasing 3.1.3 (given the fixes that have
> already gone in), but did we discuss releasing it ? I might have missed the
> thread ...
> >
> > Regards,
> > Mridul
> >
> > On Tue, Feb 1, 2022 at 7:12 PM Holden Karau 
> wrote:
> >>
> >> Please vote on releasing the following candidate as Apache Spark
> version 3.1.3.
> >>
> >> The vote is open until Feb. 4th at 5 PM PST (1 AM UTC + 1 day) and
> passes if a majority
> >> +1 PMC votes are cast, with a minimum of 3 + 1 votes.
> >>
> >> [ ] +1 Release this package as Apache Spark 3.1.3
> >> [ ] -1 Do not release this package because ...
> >>
> >> To learn more about Apache Spark, please see http://spark.apache.org/
> >>
> >> There are currently no open issues targeting 3.1.3 in Spark's JIRA
> https://issues.apache.org/jira/browse
> >> (try project = SPARK AND "Target Version/s" = "3.1.3" AND status in
> (Open, Reopened, "In Progress"))
> >> at https://s.apache.org/n79dw
> >>
> >>
> >>
> >> The tag to be voted on is v3.2.1-rc1 (commit
> >> b8c0799a8cef22c56132d94033759c9f82b0cc86):
> >> https://github.com/apache/spark/tree/v3.1.3-rc3
> >>
> >> The release files, including signatures, digests, etc. can be found at:
> >> https://dist.apache.org/repos/dist/dev/spark/v3.1.3-rc3-bin/
> >>
> >> Signatures used for Spark RCs can be found in this file:
> >> https://dist.apache.org/repos/dist/dev/spark/KEYS
> >>
> >> The staging repository for this release can be found at
> >> :
> https://repository.apache.org/content/repositories/orgapachespark-1400/
> >>
> >> The documentation corresponding to this release can be found at:
> >> https://dist.apache.org/repos/dist/dev/spark/v3.1.3-rc3-docs/
> >>
> >> The list of bug fixes going into 3.1.3 can be found at the following
> URL:
> >> https://s.apache.org/x0q9b
> >>
> >> This release is using the release script in master as of
> ddc77fb906cb3ce1567d277c2d0850104c89ac25
> >> The release docker container was rebuilt since the previous version
> didn't have the necessary components to build the R documentation.
> >>
> >> FAQ
> >>
> >>
> >> =
> >> How can I help test this release?
> >> =
> >>
> >> If you are a Spark user, you can help us test this release by taking
> >> an existing Spark workload and running on this release candidate, then
> >> reporting any regressions.
> >>
> >> If you're working in PySpark you can set up a virtual env and install
> >> the current RC and see if anything important breaks, in the Java/Scala
> >> you can add the staging repository to your projects resolvers and test
> >> with the RC (make sure to clean up the artifact cache before/after so
> >> you don't end up building with an out of date RC going forward).
> >>
> >> ===
> >> What should happen to JIRA tickets still targeting 3.1.3?
> >> ===
> >>
> >> The current list of open tickets targeted at 3.2.1 can be found at:
> >> https://issues.apache.org/jira/projects/SPARK and search for "Target
> >> Version/s" = 3.1.3
> >>
> >> Committers should look at those and triage. Extremely important bug
> >> fixes, documentation, and API tweaks that impact compatibility should
> >> be worked on immediately. Everything else please retarget to an
> >> appropriate release.
> >>
> >> ==
> >> But my bug isn't fixed?
> >> ==
> >>
> >> In order to make timely releases, we will typically not hold the
> >> release unless the bug in question is a regression from the previous
> >> release. That being said, if there is something that is a regression
> >> that has not been correctly targeted please ping me or a committer to
> >> help target the issue.
> >>
> >> ==
> >> What happened to RC1 & RC2?
> >> ==
> >>
> >> When I first went to build RC1 the build process failed due to the
> >> lack of the R markdown package in my local rm container. By the time
> >> I had time to debug and rebuild there was already another bug fix
> commit in
> >> branch-3.1 so I decided to skip ahead to RC2 and pick it up directly.
> >> When I went to go send the RC2 vote e-mail I noticed a correctness
> issue had
> >> been fixed in branch-3.1 so I rolled RC3 to contain the correctness fix.
> >>
> >> --
> >> Twitter: https://twitter.com/holdenkarau
> >> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9
> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: [VOTE] Spark 3.1.3 RC3

2022-02-02 Thread Thomas Graves
It was discussed doing all the maintenance lines back at beginning of
December (Dec 6) when we were talking about release 3.2.1.

Tom

On Wed, Feb 2, 2022 at 2:07 AM Mridul Muralidharan  wrote:
>
> Hi Holden,
>
>   Not that I am against releasing 3.1.3 (given the fixes that have already 
> gone in), but did we discuss releasing it ? I might have missed the thread ...
>
> Regards,
> Mridul
>
> On Tue, Feb 1, 2022 at 7:12 PM Holden Karau  wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version 
>> 3.1.3.
>>
>> The vote is open until Feb. 4th at 5 PM PST (1 AM UTC + 1 day) and passes if 
>> a majority
>> +1 PMC votes are cast, with a minimum of 3 + 1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.1.3
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> There are currently no open issues targeting 3.1.3 in Spark's JIRA 
>> https://issues.apache.org/jira/browse
>> (try project = SPARK AND "Target Version/s" = "3.1.3" AND status in (Open, 
>> Reopened, "In Progress"))
>> at https://s.apache.org/n79dw
>>
>>
>>
>> The tag to be voted on is v3.2.1-rc1 (commit
>> b8c0799a8cef22c56132d94033759c9f82b0cc86):
>> https://github.com/apache/spark/tree/v3.1.3-rc3
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.1.3-rc3-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at
>> :https://repository.apache.org/content/repositories/orgapachespark-1400/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.1.3-rc3-docs/
>>
>> The list of bug fixes going into 3.1.3 can be found at the following URL:
>> https://s.apache.org/x0q9b
>>
>> This release is using the release script in master as of 
>> ddc77fb906cb3ce1567d277c2d0850104c89ac25
>> The release docker container was rebuilt since the previous version didn't 
>> have the necessary components to build the R documentation.
>>
>> FAQ
>>
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with an out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 3.1.3?
>> ===
>>
>> The current list of open tickets targeted at 3.2.1 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 3.1.3
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>>
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something that is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>> ==
>> What happened to RC1 & RC2?
>> ==
>>
>> When I first went to build RC1 the build process failed due to the
>> lack of the R markdown package in my local rm container. By the time
>> I had time to debug and rebuild there was already another bug fix commit in
>> branch-3.1 so I decided to skip ahead to RC2 and pick it up directly.
>> When I went to go send the RC2 vote e-mail I noticed a correctness issue had
>> been fixed in branch-3.1 so I rolled RC3 to contain the correctness fix.
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Spark 3.1.3 RC3

2022-02-02 Thread Sean Owen
+1 from me, same result as the last release on my end.
I think releasing 3.1.3 is fine, it's 7 months since 3.1.2.


On Tue, Feb 1, 2022 at 7:12 PM Holden Karau  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 3.1.3.
>
> The vote is open until Feb. 4th at 5 PM PST (1 AM UTC + 1 day) and passes
> if a majority
> +1 PMC votes are cast, with a minimum of 3 + 1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.1.3
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> There are currently no open issues targeting 3.1.3 in Spark's JIRA
> https://issues.apache.org/jira/browse
> (try project = SPARK AND "Target Version/s" = "3.1.3" AND status in (Open,
> Reopened, "In Progress"))
> at https://s.apache.org/n79dw
>
>
>
> The tag to be voted on is v3.2.1-rc1 (commit
> b8c0799a8cef22c56132d94033759c9f82b0cc86):
> https://github.com/apache/spark/tree/v3.1.3-rc3
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.1.3-rc3-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at
> :https://repository.apache.org/content/repositories/orgapachespark-1400/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.1.3-rc3-docs/
>
> The list of bug fixes going into 3.1.3 can be found at the following URL:
> https://s.apache.org/x0q9b
>
> This release is using the release script in master as
> of ddc77fb906cb3ce1567d277c2d0850104c89ac25
> The release docker container was rebuilt since the previous version didn't
> have the necessary components to build the R documentation.
>
> FAQ
>
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with an out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.1.3?
> ===
>
> The current list of open tickets targeted at 3.2.1 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.1.3
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something that is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
> ==
> What happened to RC1 & RC2?
> ==
>
> When I first went to build RC1 the build process failed due to the
> lack of the R markdown package in my local rm container. By the time
> I had time to debug and rebuild there was already another bug fix commit in
> branch-3.1 so I decided to skip ahead to RC2 and pick it up directly.
> When I went to go send the RC2 vote e-mail I noticed a correctness issue
> had
> been fixed in branch-3.1 so I rolled RC3 to contain the correctness fix.
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


MetadataFetchFailedException due to decommission block migrations

2022-02-02 Thread Emil Ejbyfeldt
As noted in SPARK-34939 there is race when using broadcast for map 
output status. Explanation from SPARK-34939


> After map statuses are broadcasted and the executors obtain 
serialized broadcasted map statuses. If any fetch failure happens after, 
Spark scheduler invalidates cached map statuses and destroy broadcasted 
value of the map statuses. Then any executor trying to deserialize 
serialized broadcasted map statuses and access broadcasted value, 
IOException will be thrown. Currently we don't catch it in 
MapOutputTrackerWorker and above exception will fail the application.


But if running with `spark.decommission.enabled=true` and 
`spark.storage.decommission.shuffleBlocks.enabled=true` there is another 
way to hit this race, when a node is decommissioning and the shuffle 
blocks are migrated. After a block has been migrated an update will be 
sent to the driver for each block and the map output caches will be 
invalidated.


Here are a driver when we hit the race condition running with spark 3.2.0:

2022-01-28 03:20:12,409 INFO memory.MemoryStore: Block broadcast_27 
stored as values in memory (estimated size 5.5 MiB, free 11.0 GiB)
2022-01-28 03:20:12,410 INFO spark.ShuffleStatus: Updating map output 
for 192108 to BlockManagerId(760, ip-10-231-63-204.ec2.internal, 34707, 
None)
2022-01-28 03:20:12,410 INFO spark.ShuffleStatus: Updating map output 
for 179529 to BlockManagerId(743, ip-10-231-34-160.ec2.internal, 44225, 
None)
2022-01-28 03:20:12,414 INFO spark.ShuffleStatus: Updating map output 
for 187194 to BlockManagerId(761, ip-10-231-43-219.ec2.internal, 39943, 
None)
2022-01-28 03:20:12,415 INFO spark.ShuffleStatus: Updating map output 
for 190303 to BlockManagerId(270, ip-10-231-33-206.ec2.internal, 38965, 
None)
2022-01-28 03:20:12,416 INFO spark.ShuffleStatus: Updating map output 
for 192220 to BlockManagerId(270, ip-10-231-33-206.ec2.internal, 38965, 
None)
2022-01-28 03:20:12,416 INFO spark.ShuffleStatus: Updating map output 
for 182306 to BlockManagerId(688, ip-10-231-43-41.ec2.internal, 35967, None)
2022-01-28 03:20:12,417 INFO spark.ShuffleStatus: Updating map output 
for 190387 to BlockManagerId(772, ip-10-231-55-173.ec2.internal, 35523, 
None)
2022-01-28 03:20:12,417 INFO memory.MemoryStore: Block 
broadcast_27_piece0 stored as bytes in memory (estimated size 4.0 MiB, 
free 10.9 GiB)
2022-01-28 03:20:12,417 INFO storage.BlockManagerInfo: Added 
broadcast_27_piece0 in memory on ip-10-231-63-1.ec2.internal:34761 
(size: 4.0 MiB, free: 11.0 GiB)
2022-01-28 03:20:12,418 INFO memory.MemoryStore: Block 
broadcast_27_piece1 stored as bytes in memory (estimated size 1520.4 
KiB, free 10.9 GiB)
2022-01-28 03:20:12,418 INFO storage.BlockManagerInfo: Added 
broadcast_27_piece1 in memory on ip-10-231-63-1.ec2.internal:34761 
(size: 1520.4 KiB, free: 11.0 GiB)
2022-01-28 03:20:12,418 INFO spark.MapOutputTracker: Broadcast 
outputstatuses size = 416, actual size = 5747443
2022-01-28 03:20:12,419 INFO spark.ShuffleStatus: Updating map output 
for 153389 to BlockManagerId(154, ip-10-231-42-104.ec2.internal, 44717, 
None)
2022-01-28 03:20:12,419 INFO broadcast.TorrentBroadcast: Destroying 
Broadcast(27) (from updateMapOutput at BlockManagerMasterEndpoint.scala:594)
2022-01-28 03:20:12,427 INFO storage.BlockManagerInfo: Added 
rdd_65_20310 on disk on ip-10-231-32-25.ec2.internal:40657 (size: 77.6 MiB)
2022-01-28 03:20:12,427 INFO storage.BlockManagerInfo: Removed 
broadcast_27_piece0 on ip-10-231-63-1.ec2.internal:34761 in memory 
(size: 4.0 MiB, free: 11.0 GiB)


While the Broadcast is being constructed we have updates coming in and 
the broadcast is destroyed almost immediately. On this particular job we 
ended up hitting the race condition a lot of times and it caused ~18 
task failures and stage retries within 20 seconds causing us to hit our 
stage retry limit and the job to fail.


As far I understand this was the expected behavior for handling this 
case after SPARK-34939. But it seems like when combined with 
decommissioning hitting the race is a bit too common.


Anyone else running it something similar?

I willing to help up to develop a fix, but might need some guidance of 
how this case could be handled better.


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Spark 3.1.3 RC3

2022-02-02 Thread Mridul Muralidharan
Hi Holden,

  Not that I am against releasing 3.1.3 (given the fixes that have already
gone in), but did we discuss releasing it ? I might have missed the thread
...

Regards,
Mridul

On Tue, Feb 1, 2022 at 7:12 PM Holden Karau  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 3.1.3.
>
> The vote is open until Feb. 4th at 5 PM PST (1 AM UTC + 1 day) and passes
> if a majority
> +1 PMC votes are cast, with a minimum of 3 + 1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.1.3
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> There are currently no open issues targeting 3.1.3 in Spark's JIRA
> https://issues.apache.org/jira/browse
> (try project = SPARK AND "Target Version/s" = "3.1.3" AND status in (Open,
> Reopened, "In Progress"))
> at https://s.apache.org/n79dw
>
>
>
> The tag to be voted on is v3.2.1-rc1 (commit
> b8c0799a8cef22c56132d94033759c9f82b0cc86):
> https://github.com/apache/spark/tree/v3.1.3-rc3
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.1.3-rc3-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at
> :https://repository.apache.org/content/repositories/orgapachespark-1400/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.1.3-rc3-docs/
>
> The list of bug fixes going into 3.1.3 can be found at the following URL:
> https://s.apache.org/x0q9b
>
> This release is using the release script in master as
> of ddc77fb906cb3ce1567d277c2d0850104c89ac25
> The release docker container was rebuilt since the previous version didn't
> have the necessary components to build the R documentation.
>
> FAQ
>
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with an out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.1.3?
> ===
>
> The current list of open tickets targeted at 3.2.1 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.1.3
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something that is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
> ==
> What happened to RC1 & RC2?
> ==
>
> When I first went to build RC1 the build process failed due to the
> lack of the R markdown package in my local rm container. By the time
> I had time to debug and rebuild there was already another bug fix commit in
> branch-3.1 so I decided to skip ahead to RC2 and pick it up directly.
> When I went to go send the RC2 vote e-mail I noticed a correctness issue
> had
> been fixed in branch-3.1 so I rolled RC3 to contain the correctness fix.
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>