he-gong-sardine opened a new pull request, #36545:
URL: https://github.com/apache/airflow/pull/36545

   closes: #34995 
   
   ---
   
   ### Overview:
   The current implementation in Airflow allows file transfer from SFTP to GCS 
by first downloading the file to the worker's local storage, then uploading it 
to GCS. This approach can lead to low disk space issues when dealing with large 
files. This PR introduces a direct streaming method from SFTP to GCS, 
mitigating disk space concerns.
   
   ### Key Changes:
   This PR offers two options for streaming files from SFTP to GCS:
   
   - High Memory (Over 100MB): It is recommended to use Google's 
upload_from_file method, which is more robust and includes a retry feature. 
This method is preferable when sufficient memory is available.
   - Low Memory: Lower the stream_chunk_size and employ the custom method for 
streaming. This approach transfers smaller chunks of data, which is 
advantageous for systems with limited memory.
   
   For logging transfer progress:
   
   - With Option 1, a custom source_stream_wrapper can be implemented to track 
the progress.
   - With Option 2, in addition to using a custom source_stream_wrapper, 
setting the log_interval enables built-in simple logging of transfer progress.
   
   ### Significance:
   These enhancements significantly improve the operator's capability in 
handling large files. They offer greater control over memory utilization and 
transfer progress logging, leading to more efficient resource management and 
monitoring in ETL pipelines.
   
   ### Documentation:
   Documentation updates are pending. I am currently seeking guidance on 
locating the source files for documentation to ensure all necessary updates are 
made.
   
   ### Tests:
   A new test class with multiple tests has been added to cover the 
functionalities introduced in this PR. A setup function is included in the test 
class to avoid repetition of setup code across tests.
   
   **^ Add meaningful description above**
   Read the **[Pull Request 
Guidelines](https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#pull-request-guidelines)**
 for more information.
   In case of fundamental code changes, an Airflow Improvement Proposal 
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals))
 is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party 
License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in a 
newsfragment file, named `{pr_number}.significant.rst` or 
`{issue_number}.significant.rst`, in 
[newsfragments](https://github.com/apache/airflow/tree/main/newsfragments).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to