h499154897-cmyk opened a new issue, #10192:
URL: https://github.com/apache/seatunnel/issues/10192

   ### Search before asking
   
   - [x] I had searched in the 
[feature](https://github.com/apache/seatunnel/issues?q=is%3Aissue+label%3A%22Feature%22)
 and found no similar feature requirement.
   
   
   ### Description
   
   Seatunnel currently focuses on structured/semi-structured data integration 
(e.g., reading text/CSV/JSON files from SFTP and writing content to S3/Ceph). 
However, it lacks the ability to support file-level direct passthrough (whole 
file transmission) between different file systems/storage protocols. The key 
limitations are:
   
   1. Binary file incompatibility: Seatunnel parses files as text/structured 
data by default, which causes corruption or garbled content when handling 
binary files (e.g., ZIP, images, videos, executable files).
   2. Loss of original file attributes: Cannot retain the original file name, 
modification time, access permissions, file size, and other metadata during 
transmission.
   3. No whole file transmission: The current pipeline processes data line by 
line or in batches, rather than transmitting the entire file as a single unit, 
which is inefficient for large files.
   4. Limited support for file system protocols: For storage like Ceph 
(CephFS/RGW), Seatunnel relies on S3-compatible sinks but cannot directly 
interact with CephFS or other file system protocols for passthrough.
   
   Expected Feature (File System Direct Passthrough)
   We propose adding a File System Passthrough feature to Seatunnel, which 
enables direct, whole-file transmission between different storage protocols 
without parsing or modifying the file content. The core capabilities should 
include:
   
   1. Support for multiple storage protocols:
   - [ ] Source: SFTP, Local File System, HDFS, S3 (including Ceph RGW), 
CephFS, FTP/SFTP, etc.
   - [ ] Sink: Ceph (RGW/CephFS), S3, Local File System, HDFS, SFTP, OSS, COS, 
etc.
   
   2. Whole file transmission: Transmit the entire file as a single unit (no 
line-by-line parsing) to support binary files and large files efficiently.
   
   3. Preserve file attributes:
   - [ ] Retain original file names (critical for business scenarios).
   - [ ] Preserve metadata (modification time, access time, file permissions, 
file size, etc.).
   - [ ] Support custom file name mapping (e.g., adding prefixes/suffixes, 
renaming rules) if needed.
   
   4. Batch and incremental transmission:
   - [ ] Support batch transmission of all files in a specified directory 
(including subdirectories).
   - [ ] Support incremental transmission (e.g., only transmit new/modified 
files since the last sync).
   
   5. Filter and control capabilities:
   - [ ] Support file filtering via wildcards (e.g., *.log, data_*.zip) or 
regular expressions.
   - [ ] Support skipping empty files, hidden files, or files larger/smaller 
than a specified size.
   - [ ] Support configurable overwrite policies (e.g., overwrite existing 
files, skip, or append).
   
   6. Seamless integration with existing Seatunnel pipelines:
   - [ ] Provide a dedicated FilePassthrough Source/Sink plugin (or extend 
existing file connectors with a "passthrough mode").
   - [ ] Allow optional integration with Transform steps (e.g., adding file 
metadata as tags before transmission) for flexible customization.
   
   新增文件系统透传功能,支持 SFTP、本地文件、HDFS、Ceph(RGW/CephFS)、S3 等协议间的整文件传输,核心能力包括:
   支持二进制文件传输,不解析文件内容,直接透传;
   保留原文件名、修改时间、权限等元数据;
   支持批量目录同步、增量传输、文件过滤;
   与现有 Seatunnel 管道无缝集成,可选择对文件元数据进行处理。
   
   ### Usage Scenario
   
   _No response_
   
   ### Related issues
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to