otterc commented on a change in pull request #30433:
URL: https://github.com/apache/spark/pull/30433#discussion_r528907429



##########
File path: 
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java
##########
@@ -827,13 +833,16 @@ void resetChunkTracker() {
     void updateChunkInfo(long chunkOffset, int mapIndex) throws IOException {
       long idxStartPos = -1;
       try {
-        // update the chunk tracker to meta file before index file
-        writeChunkTracker(mapIndex);
         idxStartPos = indexFile.getFilePointer();
         logger.trace("{} shuffleId {} reduceId {} updated index current {} 
updated {}",
           appShuffleId.appId, appShuffleId.shuffleId, reduceId, 
this.lastChunkOffset,
           chunkOffset);
-        indexFile.writeLong(chunkOffset);
+        indexFile.write(Longs.toByteArray(chunkOffset));
+        // Chunk bitmap should be written to the meta file after the index 
file because if there are
+        // any exceptions during writing the offset to the index file, meta 
file should not be
+        // updated. If the update to the index file is successful but the 
update to meta file isn't
+        // then the index file position is reset in the catch clause.
+        writeChunkTracker(mapIndex);

Review comment:
       @Victsm The case that you described how often does it happen? How does 
this cleanup script differentiates shuffle data from temporary user data and 
how quickly does it remove this data. This has an assumption that cluster is 
setup in a certain way.
   Would you agree that the more common cases are the non-recoverable disk 
failures, permission issues, file corruption etc?
   If yes, then there will be continuous IOExceptions while updating these 
index/meta file. So if we just let the chunk grow when the index/meta file 
couldn't be updated, in most of the cases the executor will fail to fetch these.
   
   In the non-push shuffle mechanism, if writing to index file fails does the 
executor try multiple times to write to the same file locally? I understand 
that in the push-mechanism, we are updating the index/meta in a continuous 
manner but these are also just local files.
   
   > I think that policy should require more than 2 consecutive IOExceptions 
during chunk metadata write to trigger.
   Note that, the same issue could affect writing data file as well.
   
   Just a single IOException while updating the index/meta file should be 
enough to indicate the client to stop pushing. Yes, it affects writing the data 
file as well. The reason I am not bringing that up because in this followup fix 
we are focusing on the exceptions while updating metadata. If others think that 
we should address the failures during writes to data file as well in this 
followup fix, we can address it. 
   
   > If we want to address this issue holistically, it would require some state 
tracking.
   If so, I'd rather track it on the client side instead of the server side.
   
   We are already tracking the state of shuffle partition in 
`AppShufflePartitionInfo`. Adding few other flags to it shouldn't be an issue.
   
   Would like to know the thoughts of other folks as well.
   @mridulm @tgravescs @attilapiros @Ngone51 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to