Re: [PR] [Feature][Connector-V2][HdfsFile] Support true large-file split for parallel read [seatunnel]

via GitHub Thu, 15 Jan 2026 23:18:46 -0800


yzeng1618 commented on PR #10332:
URL: https://github.com/apache/seatunnel/pull/10332#issuecomment-3758519598


   > Thanks for implementing this important feature! The overall approach looks 
solid, but I found 2 CRITICAL issues that should be addressed before merging:
   > 
   > **1. Poor error observability (GEN-002)** 
`AccordingToSplitSizeSplitStrategy.java:107-109,115-117` - All IOExceptions are 
mapped to `FILE_READ_FAILED`, losing distinction between FileNotFoundException, 
permission denied, network timeout, etc. This makes troubleshooting difficult. 
Consider catching specific exception types and mapping them to appropriate 
error codes.
   > 
   > **2. Severe performance regression for large file splits (GEN-003)** 
`AbstractReadStrategy.java:510-521` - `safeSlice()` uses `InputStream.skip()` 
in a loop, which reads byte-by-byte. For HDFS files with non-first splits 
(e.g., start=10GB), this causes massive network/disk I/O just to reach the 
offset. HDFS supports `FSDataInputStream.seek()`, which should be used instead. 
Without this fix, enabling splits on large HDFS files will be slower than not 
splitting at all.
   > 
   > **Note on COR-001**: The idempotency concern in `addSplitsBack()` couldn't 
be fully verified from code review. The FileSourceSplit.equals() implementation 
suggests duplicates would be detected, but runtime behavior should be validated 
post-merge.
   
   The repair has been completed as required.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [Feature][Connector-V2][HdfsFile] Support true large-file split for parallel read [seatunnel]

Reply via email to