otterc commented on a change in pull request #30433:
URL: https://github.com/apache/spark/pull/30433#discussion_r528907429
##########
File path:
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
##########
@@ -827,13 +833,16 @@ void resetChunkTracker() {
void updateChunkInfo(long chunkOffset, int mapIndex) throws IOException {
long idxStartPos = -1;
try {
- // update the chunk tracker to meta file before index file
- writeChunkTracker(mapIndex);
idxStartPos = indexFile.getFilePointer();
logger.trace("{} shuffleId {} reduceId {} updated index current {}
updated {}",
appShuffleId.appId, appShuffleId.shuffleId, reduceId,
this.lastChunkOffset,
chunkOffset);
- indexFile.writeLong(chunkOffset);
+ indexFile.write(Longs.toByteArray(chunkOffset));
+ // Chunk bitmap should be written to the meta file after the index
file because if there are
+ // any exceptions during writing the offset to the index file, meta
file should not be
+ // updated. If the update to the index file is successful but the
update to meta file isn't
+ // then the index file position is reset in the catch clause.
+ writeChunkTracker(mapIndex);
Review comment:
@Victsm The case that you described how often does it happen? How does
this cleanup script differentiates shuffle data from temporary user data and
how quickly does it remove this data. This has an assumption that cluster is
setup in a certain way.
Would you agree that the more common cases are the non-recoverable disk
failures, permission issues, file corruption etc?
If yes, then there will be continuous IOExceptions while updating these
index/meta file. So if we just let the chunk grow when the index/meta file
couldn't be updated, in most of the cases the executor will fail to fetch these.
In the non-push shuffle mechanism, if writing to index file fails does the
executor try multiple times to write to the same file locally? I understand
that in the push-mechanism, we are updating the index/meta in a continuous
manner but these are also just local files.
> I think that policy should require more than 2 consecutive IOExceptions
during chunk metadata write to trigger.
Note that, the same issue could affect writing data file as well.
Just a single IOException while updating the index/meta file should be
enough to indicate the client to stop pushing. Yes, it affects writing the data
file as well. The reason I am not bringing that up because in this followup fix
we are focusing on the exceptions while updating metadata. If others think that
we should address the failures during writes to data file as well in this
followup fix, we can address it.
> If we want to address this issue holistically, it would require some state
tracking.
If so, I'd rather track it on the client side instead of the server side.
We are already tracking the state of shuffle partition in
`AppShufflePartitionInfo`. Adding few other flags to it shouldn't be an issue.
Would like to know the thoughts of other folks as well.
@mridulm @tgravescs @attilapiros @Ngone51
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]