Re: [PR] [Feature][Connector-V2][HdfsFile] Support true large-file split for parallel read [seatunnel]

via GitHub Thu, 15 Jan 2026 05:04:09 -0800


zhangshenghang commented on PR #10332:
URL: https://github.com/apache/seatunnel/pull/10332#issuecomment-3754691824


   <!-- code-pr-reviewer -->
   ## Blocking Issues
   
   **BLOCKER: Data loss at split boundaries**
   `AccordingToSplitSizeSplitStrategy.findNextDelimiterWithSeek()` (lines 
147-165) returns incorrect position when split end aligns with line delimiter. 
When `endPos >= startPos`, returns `endPos` directly without verifying if it 
skips the next split's start character, corrupting data in parallel reads.
   
   **CRITICAL: Parquet footer reads fail in Kerberos environments**
   `ParquetFileSplitStrategy.getFooter()` (line 133) creates `new 
Configuration()` instead of using passed `hadoopConf`. Constructor accepts 
`HadoopConf` parameter (line 67) but `getFooter()` uses empty configuration 
without Kerberos/HA settings, causing startup failures on Kerberos-enabled HDFS 
clusters.
   
   **CRITICAL: Missing Kerberos/HDFS HA validation**
   No E2E tests cover Kerberos/HA/NameService scenarios (`HdfsFileIT.java`). 
Tests use `apache/hadoop:3` basic image only; no miniKDC or Kerberos 
configuration tests, putting production Kerberos deployments at risk.
   
   **CRITICAL: Backward compatibility unverified**
   Missing regression tests for default behavior when `enable_file_split` is 
not configured. Existing jobs may behave unexpectedly after upgrade if default 
value is incorrect.
   
   **CRITICAL: Missing split boundary edge case tests**
   No tests for split boundary at line delimiter, file start/end, or multi-byte 
character boundaries. Edge cases may cause data duplication or loss; potential 
for infinite scans.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [Feature][Connector-V2][HdfsFile] Support true large-file split for parallel read [seatunnel]

Reply via email to