yzeng1618 opened a new issue, #10326:
URL: https://github.com/apache/seatunnel/issues/10326

   ### Search before asking
   
   - [x] I had searched in the 
[feature](https://github.com/apache/seatunnel/issues?q=is%3Aissue+label%3A%22Feature%22)
 and found no similar feature requirement.
   
   
   ### Description
   
   Currently connector-file-hadoop’s HdfsFile source still uses the default 
split behavior: one file -> one split. When the number of files is small but a 
single file is huge (tens of GB), the read parallelism cannot scale, so the job 
effectively reads with single concurrency.
   
   connector-file-local already added large-file splitting support in PR 
https://github.com/apache/seatunnel/pull/10142 (select split strategy by 
config: row-delimiter split for Text/CSV/JSON, RowGroup split for Parquet). 
However, HdfsFile is not covered
   
   ### Usage Scenario
   
   1. Ingest single / few extremely large files (CSV / plain log / NDJSON, tens 
of GB) stored in HDFS.
   2. Current behavior: only one split is generated per file, so only one 
reader does work even if env.parallelism is high.
   3. Expected behavior: when enable_file_split=true, split the large file into 
multiple splits and read in parallel:
   
   -   Text/CSV/JSON: split by file_split_size and align to row_delimiter (no 
broken lines, no duplicates/missing).
   -   Parquet: split by RowGroup (each RowGroup as a split, or pack RowGroups 
by size).
   
   ### Related issues
   
   https://github.com/apache/seatunnel/issues/10129
   
   ### Are you willing to submit a PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to