SamWheating opened a new issue, #63371: URL: https://github.com/apache/airflow/issues/63371
### Description The S3Hook registers an output asset for every file which it uploads: https://github.com/apache/airflow/blob/e491aac92a1d50f322a08d92d83616e7c79b3f2e/providers/amazon/src/airflow/providers/amazon/aws/hooks/s3.py#L1376-L1379 Which is not always the desired behaviour when using the S3 Hook (see motivating example below). I'd propose just adding a switch to the S3Hook so disable this sort of lineage: ```python hook = S3Hook(enable_hook_level_lineage=False) ``` I am happy to submit a fix here, but I wanted to run it by y'all first to make sure that I'm not missing some previous context or undoing an intentional design decision. ### Use case/motivation We have seen issues where users upload chunked data to S3 within a PythonOperator like so: ```python hook = S3Hook() for idx, data in enumerate(list_of_values): hook.upload_string(data, f"some_prefix/file_{idx}.txt", "some_bucket") ``` Which then creates a _ton_ of output assets. I know that this is limited to 100 output objects (since https://github.com/apache/airflow/pull/45798), but it would be nice if we could disable hook-level lineage altogether and instead manage our own output asset definition at the custom operator / PythonOperator level. In this case, we likely want to only enable a single output asset at the `some_prefix/` level, not one per file. ### Related issues _No response_ ### Are you willing to submit a PR? - [x] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
