yzeng1618 opened a new issue, #10326: URL: https://github.com/apache/seatunnel/issues/10326
### Search before asking - [x] I had searched in the [feature](https://github.com/apache/seatunnel/issues?q=is%3Aissue+label%3A%22Feature%22) and found no similar feature requirement. ### Description Currently connector-file-hadoop’s HdfsFile source still uses the default split behavior: one file -> one split. When the number of files is small but a single file is huge (tens of GB), the read parallelism cannot scale, so the job effectively reads with single concurrency. connector-file-local already added large-file splitting support in PR https://github.com/apache/seatunnel/pull/10142 (select split strategy by config: row-delimiter split for Text/CSV/JSON, RowGroup split for Parquet). However, HdfsFile is not covered ### Usage Scenario 1. Ingest single / few extremely large files (CSV / plain log / NDJSON, tens of GB) stored in HDFS. 2. Current behavior: only one split is generated per file, so only one reader does work even if env.parallelism is high. 3. Expected behavior: when enable_file_split=true, split the large file into multiple splits and read in parallel: - Text/CSV/JSON: split by file_split_size and align to row_delimiter (no broken lines, no duplicates/missing). - Parquet: split by RowGroup (each RowGroup as a split, or pack RowGroups by size). ### Related issues https://github.com/apache/seatunnel/issues/10129 ### Are you willing to submit a PR? - [x] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
