DanielLeens commented on issue #10923:
URL: https://github.com/apache/seatunnel/issues/10923#issuecomment-4538943252

   Thanks, this follow-up makes the scope much clearer.
   
   With the concrete batch-first scenario, config sketch, and the large-file 
constraint, I agree this should stay open as a feature request rather than a 
question thread.
   
   After checking the current source path, the main gap is on the `Http` source 
side:
   
   - the response is still materialized as a `String`
   - the schema path only supports JSON
   - there is no built-in filename / content-type / part metadata contract to 
pass downstream
   
   One important implementation detail here is that SeaTunnel already has an 
existing binary row model on the file side: the binary file sink expects 
`data`, `relativePath`, and `partIndex` semantics rather than an arbitrary text 
payload. So the cleanest first version may be to make `Http` binary mode align 
with that existing binary contract, instead of inventing a separate one only 
for `Http`.
   
   Given your latest details, a practical first phase would be:
   
   1. batch mode first
   2. explicit `format = binary`
   3. filename/path propagation with clear precedence between 
`file_path_expression` and `Content-Disposition`
   4. chunked emission / streaming write semantics so large files do not have 
to be fully materialized in memory
   
   Then keep retry/resume, richer streaming semantics, and more advanced 
multipart behavior for a later phase.
   
   This looks useful enough to keep open as a feature enhancement. We've 
labeled it `help wanted` so contributors can pick it up, but keeping the first 
version tightly scoped will make it much easier to land.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to