[GitHub] [airflow] ashb commented on pull request #16084: Added new pipeline example for the tutorial docs (Issue #11208)

GitBox Tue, 22 Jun 2021 01:55:57 -0700


ashb commented on pull request #16084:
URL: https://github.com/apache/airflow/pull/16084#issuecomment-865764114



   This example pipeline encodes a few anti-pattern that we don't want to 
encourage:
   
   Having one task make an HTTP and write it to a local file, and then a second 
task pick up that file and process it is will not work for a number of reasons:
   
   
   1. If you re-run an old `insert_data` task, it's going to insert _new_ data.
   1. This task is not idempotent -- every time you run it you will just get 
another copy of the rows inserted. We should use UPSERT or some kind "delete 
date range then insert" rather than a blind insert
   1. It work outside of the local executor -- if you use  Celery executors, 
the `get_data` and `insert_data` tasks could end up running on different nodes, 
or in the case of Kube, _will_ not work, as the file in the container will be 
thrown away when the task finishes and the pod is deleted.
   
   @Sanchit112 Could you update your follow-on PR to take these in to account?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] ashb commented on pull request #16084: Added new pipeline example for the tutorial docs (Issue #11208)

Reply via email to