[I] High Memory Usage and Temp File Name Leak in SqlToS3Operator [airflow]

via GitHub Sat, 03 May 2025 20:27:28 -0700


aditya-emmanuel opened a new issue, #50159:
URL: https://github.com/apache/airflow/issues/50159


   ### Apache Airflow Provider(s)
   
   amazon
   
   ### Versions of Apache Airflow Providers
   
   apache-airflow-providers-amazon==8.28.0
   
   ### Apache Airflow version
   
   2.10.3
   
   ### Operating System
   
   macOS
   
   ### Deployment
   
   Astronomer
   
   ### Deployment details
   
   _No response_
   
   ### What happened
   
   While using the SqlToS3Operator, I've noticed that the memory usage on the 
worker often exceeds expectations for large datasets. For instance, a SQL query 
that results in a ~5 GB .txt export, the task consumes approximately 10 GB of 
memory during execution.
   
   I believe this is because the operator first loads the entire query result 
into a Pandas DataFrame (in memory), writes it to a local temporary file, and 
then uploads that file to S3.
   
   ### Temporary File Name Leakage in .gz Files
   
   Additionally, when exporting to .gz files (e.g., example.txt.gz) and 
opening/unzipping the .gz file in Windows using WindowsExplorer/WinRar/7zip, 
the content reveals the temporary filename that was generated internally using 
Python’s NamedTemporaryFile()
   
   Note:- This issue occurs only when opening/unzipping the .gz file on 
Windows; it does not appear on macOS or Linux.
   
   
![Image](https://github.com/user-attachments/assets/addc2f28-ea5d-483c-b781-86a00dba9225)
   
   
   
   ### What you think should happen instead
   
   - Support direct upload to S3 via file-like buffers to avoid writing to disk
   
   - Control the internal filename for compressed outputs to avoid leaking 
temporary names
   
   ### How to reproduce
   
   
   DAG Snippet
   
   ```
   extract_operator = SqlToS3Operator(
        retries=3,
        retry_delay=timedelta(minutes=1),
        retry_exponential_backoff=True,
        task_id='extract_example_data',
        sql_conn_id="redshift_conn",
        query="SELECT * from example;",
        s3_bucket="{{ var.value.S3_BUCKET }}",
        s3_key="example.txt.gz",
        aws_conn_id="aws_datalake_conn",
        file_format="csv",
        replace=True,
        pd_kwargs={
            "sep": "|",
            "index": False,
            "quoting": csv.QUOTE_MINIMAL,
            "compression": "gzip"
        },
    )
   ```
   
   ### Anything else
   
   Code reference
   
https://github.com/apache/airflow/blob/main/providers/amazon/src/airflow/providers/amazon/aws/transfers/sql_to_s3.py#L195-L203
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] High Memory Usage and Temp File Name Leak in SqlToS3Operator [airflow]

Reply via email to