Re: [PR] [Feature][Connector-V2][HdfsFile] Support true large-file split for parallel read [seatunnel]

via GitHub Thu, 15 Jan 2026 05:37:40 -0800


zhangshenghang commented on PR #10332:
URL: https://github.com/apache/seatunnel/pull/10332#issuecomment-3754886343


   <!-- code-pr-reviewer -->
   Thanks for implementing this important feature! The overall approach looks 
solid, but I found 2 CRITICAL issues that should be addressed before merging:
   
   **1. Poor error observability (GEN-002)**
   `AccordingToSplitSizeSplitStrategy.java:107-109,115-117` - All IOExceptions 
are mapped to `FILE_READ_FAILED`, losing distinction between 
FileNotFoundException, permission denied, network timeout, etc. This makes 
troubleshooting difficult. Consider catching specific exception types and 
mapping them to appropriate error codes.
   
   **2. Severe performance regression for large file splits (GEN-003)**
   `AbstractReadStrategy.java:510-521` - `safeSlice()` uses 
`InputStream.skip()` in a loop, which reads byte-by-byte. For HDFS files with 
non-first splits (e.g., start=10GB), this causes massive network/disk I/O just 
to reach the offset. HDFS supports `FSDataInputStream.seek()`, which should be 
used instead. Without this fix, enabling splits on large HDFS files will be 
slower than not splitting at all.
   
   **Note on COR-001**: The idempotency concern in `addSplitsBack()` couldn't 
be fully verified from code review. The FileSourceSplit.equals() implementation 
suggests duplicates would be detected, but runtime behavior should be validated 
post-merge.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [Feature][Connector-V2][HdfsFile] Support true large-file split for parallel read [seatunnel]

Reply via email to