otterc commented on pull request #34934: URL: https://github.com/apache/spark/pull/34934#issuecomment-1016956929
> @otterc I reproduced this issue today and sent a email to you with logs and spark confs you requested [here](https://github.com/apache/spark/pull/35076#issuecomment-1010458523), highly suspect that [SPARK-37675](https://issues.apache.org/jira/browse/SPARK-37675) and [SPARK-37793](https://issues.apache.org/jira/browse/SPARK-37793) share the same root cause. Please let me know if any other things I can do. @pan3793 Would you be able to add these changes and rerun this test? 1. Log the reduceId in the iterator for which the assertion fails. Changing the assertion to this will work: `assert(numChunks > 0, s"zero chunks for $blockId")` 2. In the `RemoteBlockPushResolver.finalizeShuffleMerge`, add this condition for the partition `mapTracker` and `reduceId` to be added to the results: ``` try { // This can throw IOException which will marks this shuffle partition as not merged. partition.finalizePartition(); if (partition.mapTracker.getCardinality() > 0) { // needs to be added bitmaps.add(partition.mapTracker); reduceIds.add(partition.reduceId); sizes.add(partition.getLastChunkOffset()); } } catch (IOException ioe) { logger.warn("Exception while finalizing shuffle partition {}_{} {} {}", msg.appId, msg.appAttemptId, msg.shuffleId, partition.reduceId, ioe); } finally { partition.closeAllFilesAndDeleteIfNeeded(false); } ``` Please let me know if you can rerun with these changes and share the logs with me. Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
