VincentSleepless opened a new issue, #4729:
URL: https://github.com/apache/incubator-seatunnel/issues/4729

   ### Search before asking
   
   - [X] I had searched in the 
[feature](https://github.com/apache/incubator-seatunnel/issues?q=is%3Aissue+label%3A%22Feature%22)
 and found no similar feature requirement.
   
   
   ### Description
   
   
   [Feature][Connector-V2 File] connector-file-oss write mode optimization
   
   In the current connector design mode, all file connectors are abstracted by 
the hadoop fileSystem to read and write to specific storage(s3,oss,localfile , 
ftp,hdfs ).
   
   when we sink write data to some special file storage system, suck as s3 or 
aliyun ,  the default policy to buffer tmp data is wirte local file to avoid 
memory cost,but it may cause some problems.
   1.all write task speed  will related by disk IO performance , especially for 
large file intergration.
   2.some connector source checkpoint policy is split , when a task split is 
finish ,  it will trigger a checkpoint , then file connector sink upload local 
tmp file to storage tmp directory, then rename to config path, the capacity of 
temp files  is unpredictable.
   
   when we sink write data to aws s3, hadoop-aws  may buffer data in local disk 
or memory in config ,we can avoid the promblems.
   https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html
   
   
![image](https://github.com/apache/incubator-seatunnel/assets/15063109/f3e4eceb-b5f4-41f8-8546-e20be65b5842)
   
   
   but when we sink write data to aliyun oss, hadoop-aliyun has no config like 
haoodoop-aws, all data will buffer in local disk.
   
https://hadoop.apache.org/docs/stable/hadoop-aliyun/tools/hadoop-aliyun/index.html
   
   
![image](https://github.com/apache/incubator-seatunnel/assets/15063109/d0798d5a-5a0e-4978-989d-aee71b7514e5)
   
   
   this problem  can also occur in  engine checkpoint  storage  and imap 
storage.
   
   the best idea to flush data for data integration in memory , this will avoid 
the io performanca problem ,can we abstract the commom method for inputout put 
stream in file connectors , the special storage may use special sdk to read and 
writ? or we try to support buffer mode in hadoop aliyun ?
   
   welcome to discuss~
   
   ### Usage Scenario
   
   connector-file-oss
   engine checkpoint-storage
   engine imap-storage
   
   
   ### Related issues
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to