xuanyuanking commented on a change in pull request #24892: [SPARK-25341][Core]
Support rolling back a shuffle map stage and re-generate the shuffle files
URL: https://github.com/apache/spark/pull/24892#discussion_r296998488
##########
File path: core/src/main/scala/org/apache/spark/network/BlockDataManager.scala
##########
@@ -32,6 +32,12 @@ trait BlockDataManager {
*/
def getBlockData(blockId: BlockId): ManagedBuffer
+ /**
+ * Interface to get shuffle block data. Throws an exception if the block
cannot be found or
+ * cannot be read successfully.
+ */
+ def getShuffleBlockData(blockId: BlockId, shuffleGenerationId: Int):
ManagedBuffer
Review comment:
```
nit: shuffleGenerationId should be put before blockId.
```
Copy that, will be done in next commit and also change the order in
`ShuffleBlockResolver.getBlockData`
```
BTW, can we just define it as def getShuffleBlockData(shuffleId: Int,
shuffleGenerationId: Int, mapId: Int, reducerId: Int)?
```
Yep, it's more natural when it is used in NettyBlockRpcServer, but in
ShuffleBlockFetcherIterator, the blockId will be unapply and apply. WDYT?
```
why doesn't the putBlockData method need a shuffleGenerationId parameter?
```
I also spent some time on this problem during this work. [This
comment](https://github.com/apache/spark/pull/24892#discussion_r296475671)
contains the partial answer. In the `BlockDataManager`, as the original design,
getBlockData and putBlockData are pair operation response for saving and
getting local block data. The shuffle block data is controlled by
ShuffleBlockResolver, ShuffleWriter, and MapOutputTracker, which didn't use
putBlockData.
So as we discuss in the above comment, after deleting the old version
support in a later version, we'll get the cleanest semantic for these 3
operations, get/putBlockData only response for local block and
getShuffleBlockData handle shuffle block data fetching.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]