yashtc opened a new pull request, #54647: URL: https://github.com/apache/spark/pull/54647
### What changes were proposed in this pull request? Schema inference via `mergeSchema` can fail when a file is deleted between the file listing step and the footer-reading step. This is a real race condition in cloud storage environments where file disappearance between listing and reading is common. The `spark.sql.files.ignoreMissingFiles` option already suppresses `FileNotFoundException` during data reads (`FileScanRDD`) but was silently ignored during schema inference. This PR propagates `ignoreMissingFiles` through the Parquet and ORC schema inference paths: - `SchemaMergeUtils.mergeSchemasInParallel`: extracts `ignoreMissingFiles` from `parameters` and passes it as a fourth argument to the `schemaReader` function (type updated accordingly). - `ParquetFileFormat.readParquetFootersInParallel`: catches exceptions with `FileNotFoundException` anywhere in the cause chain (using `ExceptionUtils.getThrowables`) and skips the file when `ignoreMissingFiles=true`. The cause-chain check is needed because Parquet wraps `IOException` in `RuntimeException`. - `OrcUtils.readSchema` / `readOrcSchemasInParallel`: catches `FileNotFoundException` directly before the existing `FileFormatException` handler. - `OrcFileOperator.getFileReader` / `readOrcSchemasInParallel`: same pattern for Hive ORC. ### Why are the changes needed? Without this fix, any user that sets `mergeSchema=true` on a path with concurrent deletes gets an unrecoverable exception even when they have opted into tolerating missing files via `spark.sql.files.ignoreMissingFiles`. ### Does this PR introduce _any_ user-facing change? Yes: when `spark.sql.files.ignoreMissingFiles=true`, files that disappear between listing and schema reading are now silently skipped (consistent with the existing behaviour during data reads) instead of causing an error. ### How was this patch tested? - Unit tests in `ParquetFileFormatSuite`: direct calls to `readParquetFootersInParallel` with a deleted file (local FS), with a `RuntimeException`-wrapped `FileNotFoundException` (via `WrappingFNFLocalFileSystem`), and end-to-end through `mergeSchemasInParallel`. - Unit tests in `OrcSourceSuite`: direct calls to `OrcUtils.readOrcSchemasInParallel` with a deleted file. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code v2.1.69 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
