yzeng1618 commented on PR #10332: URL: https://github.com/apache/seatunnel/pull/10332#issuecomment-3758519598
> Thanks for implementing this important feature! The overall approach looks solid, but I found 2 CRITICAL issues that should be addressed before merging: > > **1. Poor error observability (GEN-002)** `AccordingToSplitSizeSplitStrategy.java:107-109,115-117` - All IOExceptions are mapped to `FILE_READ_FAILED`, losing distinction between FileNotFoundException, permission denied, network timeout, etc. This makes troubleshooting difficult. Consider catching specific exception types and mapping them to appropriate error codes. > > **2. Severe performance regression for large file splits (GEN-003)** `AbstractReadStrategy.java:510-521` - `safeSlice()` uses `InputStream.skip()` in a loop, which reads byte-by-byte. For HDFS files with non-first splits (e.g., start=10GB), this causes massive network/disk I/O just to reach the offset. HDFS supports `FSDataInputStream.seek()`, which should be used instead. Without this fix, enabling splits on large HDFS files will be slower than not splitting at all. > > **Note on COR-001**: The idempotency concern in `addSplitsBack()` couldn't be fully verified from code review. The FileSourceSplit.equals() implementation suggests duplicates would be detected, but runtime behavior should be validated post-merge. The repair has been completed as required. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
