he-gong-sardine opened a new pull request, #36545: URL: https://github.com/apache/airflow/pull/36545
closes: #34995 --- ### Overview: The current implementation in Airflow allows file transfer from SFTP to GCS by first downloading the file to the worker's local storage, then uploading it to GCS. This approach can lead to low disk space issues when dealing with large files. This PR introduces a direct streaming method from SFTP to GCS, mitigating disk space concerns. ### Key Changes: This PR offers two options for streaming files from SFTP to GCS: - High Memory (Over 100MB): It is recommended to use Google's upload_from_file method, which is more robust and includes a retry feature. This method is preferable when sufficient memory is available. - Low Memory: Lower the stream_chunk_size and employ the custom method for streaming. This approach transfers smaller chunks of data, which is advantageous for systems with limited memory. For logging transfer progress: - With Option 1, a custom source_stream_wrapper can be implemented to track the progress. - With Option 2, in addition to using a custom source_stream_wrapper, setting the log_interval enables built-in simple logging of transfer progress. ### Significance: These enhancements significantly improve the operator's capability in handling large files. They offer greater control over memory utilization and transfer progress logging, leading to more efficient resource management and monitoring in ETL pipelines. ### Documentation: Documentation updates are pending. I am currently seeking guidance on locating the source files for documentation to ensure all necessary updates are made. ### Tests: A new test class with multiple tests has been added to cover the functionalities introduced in this PR. A setup function is included in the test class to avoid repetition of setup code across tests. **^ Add meaningful description above** Read the **[Pull Request Guidelines](https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#pull-request-guidelines)** for more information. In case of fundamental code changes, an Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals)) is needed. In case of a new dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). In case of backwards incompatible changes please leave a note in a newsfragment file, named `{pr_number}.significant.rst` or `{issue_number}.significant.rst`, in [newsfragments](https://github.com/apache/airflow/tree/main/newsfragments). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
