MoMingMq commented on issue #10923:
URL: https://github.com/apache/seatunnel/issues/10923#issuecomment-4532877547

   Thank you for the detailed analysis and MVP suggestions. This direction is 
very reasonable and fully agreed upon.
   
   For the first batch of implementations, I will provide some specific 
scenarios and expected details:
   
   **1. Usage scenarios**
   -Download reports/attachments from the internal file service API and upload 
them to the local directory for downstream processing
   -Batch pull remote images PDF、 Local archiving of compressed files, etc
   -At present, it is mainly a batch scenario, and streaming can be further 
expanded in the future
   
   **2. Expected configuration and behavior**
   ```config
   env {
     parallelism = 1
     job.mode = "BATCH"
   }
   
   source {
     Http {
         method=post
         "retry_backoff_multiplier_ms"=100
         "retry_backoff_max_ms"=10000
         "json_filed_missed_return_null"="false"
         url="http://xxx/file/down";
         parallelism = 1
         format = "binary"
         # Optional: Custom output path, for example, some APIs may have 
dynamic API parameters, such as/file/down/demo.pdf, which can be 
directly/file/down/${filename}, so that the address on the API path can be 
directly obtained. If not set, the file name will be obtained from the request 
header by default
         # file_path_expression = "downloads/${filename}" 
     }
   }
   sink {
     LocalFile {
       path = "/opt/demo2/"
       file_format_type = "binary"
     }
   }
   ```
   
   **3. Regarding large file processing
   Agree not to perform chunked breakpoint continuation for the first version. 
But I hope to support streaming writing (downloading while downloading, not 
loading all into memory), otherwise large files (such as 1GB+) are prone to OOM.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to