[
https://issues.apache.org/jira/browse/SPARK-35546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ye Zhou updated SPARK-35546:
----------------------------
Summary: Properly handle race conditions in RemoteBlockPushResolver to
support push based shuffle with multiple app attempts enabled (was: Properly
handle race conditions in RemoteBlockPushResolver for access to the internal
ConcurrentHashMaps with multiple app attempts enabled)
> Properly handle race conditions in RemoteBlockPushResolver to support push
> based shuffle with multiple app attempts enabled
> ---------------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-35546
> URL: https://issues.apache.org/jira/browse/SPARK-35546
> Project: Spark
> Issue Type: Sub-task
> Components: Shuffle
> Affects Versions: 3.1.0
> Reporter: Ye Zhou
> Priority: Major
>
> In the current implementation of RemoteBlockPushResolver, two
> ConcurrentHashmap are used to store #1 applicationId ->
> mergedShuffleLocalDirPath #2 applicationId+attemptId+shuffleID ->
> mergedShuffleParitionInfo. As there are four types of messages:
> ExecutorRegister, PushBlocks, FinalizeShuffleMerge and ApplicationRemove,
> will trigger different types of operations within these two hashmaps, it is
> required to maintain strong consistency about the informations stored in
> these two hashmaps. Otherwise, either there will be data
> corruption/correctness issues or memory leak in shuffle server.
> We should come up with systematic way to resolve this, other than spot fixing
> the potential issues.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]