wangyum opened a new pull request, #3395: URL: https://github.com/apache/parquet-java/pull/3395
### Rationale for this change When reading Parquet files from HDFS, `getFileStatus()` is called twice for each file: 1. During footer reading in `ParquetFileReader.readAllFootersInParallel()` 2. During split generation in `ParquetInputFormat.getSplits()` This creates redundant NameNode RPC calls. For workloads processing thousands of files, this redundancy significantly increases NameNode load and job startup time. This PR caches `FileStatus` in the `Footer` object to eliminate redundant RPC calls, reducing NameNode RPC calls during Parquet file processing. ### What changes are included in this PR? 1. **`Footer.java`**: Added `FileStatus` field with backward-compatible constructors 2. **`ParquetFileReader.java`**: Pass `FileStatus` when creating `Footer` objects 3. **`ParquetInputFormat.java`**: Reuse cached `FileStatus` instead of calling `fs.getFileStatus()` again 4. **`TestFooterFileStatusCaching.java`**: New test suite with 5 tests proving RPC reduction ### Are these changes tested? **Yes.** Added comprehensive test suite `TestFooterFileStatusCaching` with 5 test cases: - ✅ Footer stores and returns FileStatus correctly - ✅ ParquetFileReader passes FileStatus to Footer - ✅ Cached FileStatus is reused (saves 3 RPCs in test) - ✅ Complete workflow verification (saves 5 RPCs in test) - ✅ Backward compatibility verified ### Are there any user-facing changes? No. Closes #3394 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
