[GitHub] [airflow] kalyani-subbiah opened a new issue, #23514: Json files from S3 downloading as text files

GitBox Thu, 05 May 2022 14:59:29 -0700


kalyani-subbiah opened a new issue, #23514:
URL: https://github.com/apache/airflow/issues/23514


   ### Apache Airflow Provider(s)
   
   amazon
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Apache Airflow version
   
   2.3.0 (latest released)
   
   ### Operating System
   
   Mac OS Mojave 10.14.6
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Deployment details
   
   _No response_
   
   ### What happened
   
   When I download a json file from S3 using the S3Hook:
   
   `filename=s3_hook.download_file(bucket_name=self.source_s3_bucket, key=key, 
local_path="./data")
   `
   
   The file is being downloaded as a text file starting with `airflow_temp_`. 
   
   ### What you think should happen instead
   
   It would be nice to have them download as a json file or keep the same 
filename as in S3. Since it requires additional code to go back and read the 
file as a dictionary (ast.literal_eval) and there is no guarantee that the json 
structure is maintained.
   
   ### How to reproduce
   
   Where s3_conn_id is the Airflow connection and s3_bucket is a bucket on AWS 
S3. 
   This is the custom operator class:
   `
   from airflow.models.baseoperator import BaseOperator
   from airflow.utils.decorators import apply_defaults
   from airflow.hooks.S3_hook import S3Hook
   
   import logging
   class S3SearchFilingsOperator(BaseOperator):
        """
        Queries the Datastore API and uploads the processed info as a csv to 
the S3 bucket. 
   
        :param source_s3_bucket:          Choose source s3 bucket
        :param source_s3_directory:       Source s3 directory
        :param s3_conn_id:                    S3 Connection ID
        :param destination_s3_bucket:             S3 Bucket Destination
        """
   
        @apply_defaults
        def __init__(
                self,
                source_s3_bucket=None,
                source_s3_directory=True,
                s3_conn_id=True,
                destination_s3_bucket=None,
                destination_s3_directory=None,
                search_terms=[],
                *args, 
                **kwargs) -> None: 
                
                super().__init__(*args, **kwargs)
                
                self.source_s3_bucket = source_s3_bucket
                self.source_s3_directory = source_s3_directory
                self.s3_conn_id = s3_conn_id
                self.destination_s3_bucket = destination_s3_bucket      
                self.destination_s3_directory = destination_s3_directory        
   
        def execute(self, context):
                """
                Executes the operator.
                """
                s3_hook = S3Hook(self.s3_conn_id)
                keys = s3_hook.list_keys(bucket_name=self.source_s3_bucket)
                for key in keys:
                        # download file
                        
filename=s3_hook.download_file(bucket_name=self.source_s3_bucket, key=key, 
local_path="./data")
                        logging.info(filename)
                        with open(filename, 'rb') as handle:
                                filing = handle.read()
                        filing = pickle.loads(filing)
                        logging.info(filing.keys())
   `
   
   And this is the dag file:
   
   `
   from keywordSearch.operators.s3_search_filings_operator import 
S3SearchFilingsOperator
   
   from airflow import DAG
   from airflow.utils.dates import days_ago
   
   from datetime import timedelta
   
   # from aws_pull import aws_pull
   
   default_args = {
     "owner" : "airflow",
     "depends_on_past" : False,
     "start_date": days_ago(2),
     "email" : ["[email protected]"],
     "email_on_failure" : False,
     "email_on_retry" : False,
     "retries" : 1,
     "retry_delay": timedelta(seconds=30)
   }
   
   with DAG("keyword-search-full-load",
     default_args=default_args,
     description="Syntax Keyword Search",
     max_active_runs=1,
     schedule_interval=None) as dag:
   
     op3 = S3SearchFilingsOperator(
       task_id="s3_search_filings",
       source_s3_bucket="processed-filings",
       source_s3_directory="citations",
       s3_conn_id="Syntax_S3",
       destination_s3_bucket="keywordsearch",
       destination_s3_directory="results",
       dag=dag
     )
   
   op3
   `
   
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] kalyani-subbiah opened a new issue, #23514: Json files from S3 downloading as text files

Reply via email to