Kengo Seki created AIRFLOW-2382:
-----------------------------------
Summary: Fix wrong description for delimiter
Key: AIRFLOW-2382
URL: https://issues.apache.org/jira/browse/AIRFLOW-2382
Project: Apache Airflow
Issue Type: Bug
Components: aws, operators
Reporter: Kengo Seki
The document for S3ListOperator says:
{code}
:param delimiter: The delimiter by which you want to filter the objects.
For e.g to lists the CSV files from in a directory in S3 you would use
delimiter='.csv'.
{code}
{code}
**Example**:
The following operator would list all the CSV files from the S3
``customers/2018/04/`` key in the ``data`` bucket. ::
s3_file = S3ListOperator(
task_id='list_3s_files',
bucket='data',
prefix='customers/2018/04/',
delimiter='.csv',
aws_conn_id='aws_customers_conn'
)
{code}
but it actually behaves oppositely:
{code}
In [1]: from airflow.contrib.operators.s3_list_operator import S3ListOperator
In [2]: S3ListOperator(task_id='t', bucket='bkt0', prefix='',
aws_conn_id='s3').execute(None)
[2018-04-26 10:34:27,001] {connectionpool.py:735} INFO - Starting new HTTPS
connection (1): bkt0.s3.amazonaws.com
[2018-04-26 10:34:27,711] {connectionpool.py:735} INFO - Starting new HTTPS
connection (1): bkt0.s3-ap-northeast-1.amazonaws.com
[2018-04-26 10:34:27,801] {connectionpool.py:735} INFO - Starting new HTTPS
connection (1): bkt0.s3.ap-northeast-1.amazonaws.com
Out[2]: ['0.csv', '1.txt', '2.jpg', '3.exe']
In [3]: S3ListOperator(task_id='t', bucket='bkt0', prefix='', aws_conn_id='s3',
delimiter='.csv').execute(None)
[2018-04-26 10:34:39,722] {connectionpool.py:735} INFO - Starting new HTTPS
connection (1): bkt0.s3.amazonaws.com
[2018-04-26 10:34:40,483] {connectionpool.py:735} INFO - Starting new HTTPS
connection (1): bkt0.s3-ap-northeast-1.amazonaws.com
[2018-04-26 10:34:40,569] {connectionpool.py:735} INFO - Starting new HTTPS
connection (1): bkt0.s3.ap-northeast-1.amazonaws.com
Out[3]: ['1.txt', '2.jpg', '3.exe']
{code}
This is because that the 'delimiter' parameter is for representing path
hierarchy (so '/' is used typically), not file extension. Also
S3ToGoogleCloudStorageOperator has the same problem.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)