wangyum opened a new pull request, #3395:
URL: https://github.com/apache/parquet-java/pull/3395

   ### Rationale for this change
   When reading Parquet files from HDFS, `getFileStatus()` is called twice for 
each file:
   1. During footer reading in `ParquetFileReader.readAllFootersInParallel()` 
   2. During split generation in `ParquetInputFormat.getSplits()`
   
   This creates redundant NameNode RPC calls. For workloads processing 
thousands of files, this redundancy significantly increases NameNode load and 
job startup time. 
   This PR caches `FileStatus` in the `Footer` object to eliminate redundant 
RPC calls, reducing NameNode RPC calls during Parquet file processing. 
   
   ### What changes are included in this PR?
   1. **`Footer.java`**: Added `FileStatus` field with backward-compatible 
constructors
   2. **`ParquetFileReader.java`**: Pass `FileStatus` when creating `Footer` 
objects
   3. **`ParquetInputFormat.java`**: Reuse cached `FileStatus` instead of 
calling `fs.getFileStatus()` again
   4. **`TestFooterFileStatusCaching.java`**: New test suite with 5 tests 
proving RPC reduction
   
   ### Are these changes tested?
   **Yes.** Added comprehensive test suite `TestFooterFileStatusCaching` with 5 
test cases:
   - ✅ Footer stores and returns FileStatus correctly
   - ✅ ParquetFileReader passes FileStatus to Footer
   - ✅ Cached FileStatus is reused (saves 3 RPCs in test)
   - ✅ Complete workflow verification (saves 5 RPCs in test)
   - ✅ Backward compatibility verified
   
   
   ### Are there any user-facing changes?
   No.
   
   Closes #3394
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to