MoMingMq commented on issue #10923:
URL: https://github.com/apache/seatunnel/issues/10923#issuecomment-4532877547
Thank you for the detailed analysis and MVP suggestions. This direction is
very reasonable and fully agreed upon.
For the first batch of implementations, I will provide some specific
scenarios and expected details:
**1. Usage scenarios**
-Download reports/attachments from the internal file service API and upload
them to the local directory for downstream processing
-Batch pull remote images PDF、 Local archiving of compressed files, etc
-At present, it is mainly a batch scenario, and streaming can be further
expanded in the future
**2. Expected configuration and behavior**
```config
env {
parallelism = 1
job.mode = "BATCH"
}
source {
Http {
method=post
"retry_backoff_multiplier_ms"=100
"retry_backoff_max_ms"=10000
"json_filed_missed_return_null"="false"
url="http://xxx/file/down"
parallelism = 1
format = "binary"
# Optional: Custom output path, for example, some APIs may have
dynamic API parameters, such as/file/down/demo.pdf, which can be
directly/file/down/${filename}, so that the address on the API path can be
directly obtained. If not set, the file name will be obtained from the request
header by default
# file_path_expression = "downloads/${filename}"
}
}
sink {
LocalFile {
path = "/opt/demo2/"
file_format_type = "binary"
}
}
```
**3. Regarding large file processing
Agree not to perform chunked breakpoint continuation for the first version.
But I hope to support streaming writing (downloading while downloading, not
loading all into memory), otherwise large files (such as 1GB+) are prone to OOM.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]