potiuk commented on pull request #17609:
URL: https://github.com/apache/airflow/pull/17609#issuecomment-899055233


   > Is there any advantage on saving the file locally in a temporary manner? I 
am wondering if it makes sense to just change the way it uploads the file to S3 
without giving the option to store the temporary file in local system
   
   I think the main reason are implementation details of the `upload_fileobj`. 
It's not really obvious how the data is buffered while `upload_fileobj` runs so 
there might be significant memory usage during this operation. From what I see 
the description of upload_fileobj, whenever possible it will use multiple 
threads and upload s3 object in parallel (which - I know for a fact) can speed 
up the s3 upload immensely (this is how S3 upload is designed). However (my 
guess but quite likely), this cannot be done if the "fileobj" does not provide 
"seek()" functionality. Looking how sftp get is implemented, it's fileobj does 
not allow seek, it can only read the file sequentially (this is how sftp 
protocol works I believe). It could only provide "seek" if it loaded the file 
entirely in memory first (but this would not be good for huge files).
   
   So if you have a fast (local network) sftp connection, downloading the file 
first and then uploading the local file might significantly speed up the 
transfer, as `upload_fileobj` will be able to utilise multiple threads to 
upload.  That's moslty educated guess, but I think  it's very likely.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to