curiousjazz77 opened a new issue #8448: Create S3ListPrefixesOperator that 
returns subfolders from an S3 bucket.
URL: https://github.com/apache/airflow/issues/8448
 
 
   **Description**
   
   Create operator that lists s3 prefixes in order to return a list of 
subfolders from an S3 bucket. 
   
   **Use case / motivation**
   We would like to use an operator to return subfolders in an s3 bucket. This 
would help us determine partitioning keys for an ETL. Getting these subfolders 
would allow us to parse them and then pass them to a LivyOperator or the 
AWSAthenaOperator to then ETL data into a table partitioned by the returned 
partition key (based on the subfolders found in said s3 bucket). 
   
   Currently, the 
[S3ListOperator](https://github.com/apache/airflow/blob/master/airflow/providers/amazon/aws/operators/s3_list.py)
 uses the S3Hook to list keys in a bucket:
   
   ```
   def execute(self, context):
           hook = S3Hook(aws_conn_id=self.aws_conn_id, verify=self.verify)
   
           self.log.info(
               'Getting the list of files from bucket: %s in prefix: %s 
(Delimiter {%s)',
               self.bucket, self.prefix, self.delimiter
           )
   
           return hook.list_keys(
               bucket_name=self.bucket,
               prefix=self.prefix,
               delimiter=self.delimiter)
   ```
   
   It only returns files and not subfolders because the 
[S3Hook's](https://github.com/apache/airflow/blob/8517cb189872aebdd70e9d5d584478cd7fb65d0f/airflow/providers/amazon/aws/hooks/s3.py#L241)
 list_key function parses each result page that contains contents. 
   
   ```
   for page in response:
               if 'Contents' in page:
                   has_results = True
                   for k in page['Contents']:
                       keys.append(k['Key'])
   ```
   
   If we want to get subfolders in a bucket, we instead want to use the 
[list_prefixes](https://github.com/apache/airflow/blob/8517cb189872aebdd70e9d5d584478cd7fb65d0f/airflow/providers/amazon/aws/hooks/s3.py#L199)
 method of the S3Hook because it will look for subfolders found under 'Common 
Prefixes' for a given page in the response. 
   
   Currently, there is not an operator using this method. This issue proposes 
that an S3ListPrefixesOperator be created that uses the `list_prefixes` method 
of the S3Hook. 
   ```
   for page in response:
               if 'CommonPrefixes' in page:
                   has_results = True
                   for common_prefix in page['CommonPrefixes']:
                       prefixes.append(common_prefix['Prefix'])
   
           if has_results:
               return prefixes
           return None
   ```
   
   We have tested the creation of a new operator like this internally and it 
has worked in our preliminary trials. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to