viirya commented on pull request #35994: URL: https://github.com/apache/spark/pull/35994#issuecomment-1082465493
> Is it a unique issue of HDFS? If so, I’m surprised that HDFS client cannot survive from transient errors. Is Spark the right layer to fix this issue? The error happened when opening files on S3. I'm not familiar with HDFS client or S3 client, but seems I don't see retrying happened there. I think we have similar retrying mechanisms in Spark at several places, if you ask me if Spark is the right layer to fix this issue, I'm not sure if it is exactly right place, but it seems a consistent approach? > In addition, will the same issue happen after `open`? For example, when reading the file content? Do we need to worry about other places as well? This is not the only place that touches FileSystem in the driver. I'd say I don't exclude the possibilities. Currently this is constrained on `open` only as it is the issue. This seems also the most safer place to retry as I don't want to change the behavior unexpectedly. > Could you also add a unit test to verify the retry code? For example, you can use a fake file system to simulate the errors from `open`. Okay. I'll try to add one. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
