Is it possible that the sql you're running to get customer ids is not the same every time? That's what I (loosely) meant by non-deterministic.
The error message suggests that Airflow is sending out an instruction to run a DAG called "Pipeline_DEVTEST_CDB_DEVTEST_00_B10C8DBE1CFA89C1F274B" which it expects to find in pipeline.py. However, it is not finding any DAG object by that name. That's why I'm wondering if your code is always generating the exact same DAGs. For example, if customer "DEVTEST_CDB_DEVTEST_00_B10C8DBE1CFA89C1F274B" had been deleted from your database, then Airflow might send out a command to run that DAG immediately before the deletion and be unable to load it the next time it parses the file. In general, Airflow assumes that DAGs are static and unchanging (or slowly changing at best). If you want to have parameterized (per-customer) workflows, it might be best to create a single DAG/tasks that define your general workflow (for example, "retrieve customer i information" -> "process customer i information" -> "store customer i information"). That way your DAG and tasks remain stable. Perhaps someone else on the list could share an effective pattern along those lines. Now, if that isn't your situation, something strange is happening that's preventing Airflow from locating the DAG. On Fri, Apr 29, 2016 at 6:49 PM harish singh <[email protected]> wrote: > *"However a DAG with such a complicated name isn't referenced in the > examplecode (just "Pipeline" + i). My guess is that the DAG id is being > generatedin a non-deterministic or time-based way, and therefore the run > commandcan't find it once the generation criteria change. But hard to say > withoutmore detail."* > [response] I am not sure what you mean by non-deterministic way. > We are dynamically creating DAGs (1 Dag per customer) . So if there are 100 > customers with itds 1, 2,3 .... 100, there will be 100 pipelines/dags with > names: > > *Pipeline_1, Pipeline_2, Pipeline_3 ...... Pipeline_100* > > The customer names I get from our database table which stores this > information. > > so the flow is: > > customer_id_list -> getCustomerIds( //run some sql ) > for each customerId in customer_id_list: > dag = DAG("Pipeline_"+ customerId, default_args=default_args, > *schedule_interval= **datetime.timedelta(minutes=60)*) > > > Let me know if that helps(or confuses :) ) you in understanding the flow. > If there is some thing wrong in what I am doing, I would love to know what > is it. This seems to be a serious issue (if it is) esp while running > backfill. > > > > On Fri, Apr 29, 2016 at 12:19 PM, Jeremiah Lowin <[email protected]> wrote: > > > That error message usually means that an error took place inside Airflow > > before the task ran -- maybe something with setting up the task? The > task's > > state is NONE, meaning it never even started, but the executor is > reporting > > that it successfully sent the command to start the task (SUCCESS)... the > > culprit is some failure in between. > > > > The error message seems to say that the DAG itself couldn't be loaded > from > > the .py file: > > > > airflow.utils.AirflowException: DAG > > [Pipeline_DEVTEST_CDB_DEVTEST_00_B10C8DBE1CFA89C1F274B] > > could not be found in /usr/local/airflow/dags/pipeline.py > > > > However a DAG with such a complicated name isn't referenced in the > example > > code (just "Pipeline" + i). My guess is that the DAG id is being > generated > > in a non-deterministic or time-based way, and therefore the run command > > can't find it once the generation criteria change. But hard to say > without > > more detail. > > > > > > > > On Fri, Apr 29, 2016 at 3:11 PM Bolke de Bruin <[email protected]> > wrote: > > > > > I would really like to know what the use case is for a depends_on_past > on > > > the *task* level. What past are you trying to depend on? > > > > > > What I am currently assuming from just reading the example and replying > > on > > > my phone is that the depends_on_past prevents execution. Have t4 and t5 > > > ever run? > > > > > > Bolke > > > > > > Sent from my iPhone > > > > > > > On 29 apr. 2016, at 21:01, Chris Riccomini <[email protected]> > > > wrote: > > > > > > > > @Bolke/@Jeremiah, do you guys think this is related? Full thread is > > here: > > > > > > https://groups.google.com/forum/?pli=1#!topic/airbnb_airflow/y7wt3I24Rmw > > > > > > > >> On Fri, Apr 29, 2016 at 11:57 AM, Chris Riccomini <[email protected] > > > > > wrote: > > > >> > > > >> Please subscribe to the dev@ mailing list. Sorry to make you jump > > > through > > > >> hoops--I know it's annoying--but it's for a good cause. ;) > > > >> > > > >> This looks like a bug. I'm wondering if it's related to > > > >> https://issues.apache.org/jira/browse/AIRFLOW-20. Perhaps the > > backfill > > > is > > > >> causing a mis-alignment between the dag runs, and depends_on_past > > logic > > > >> isn't seeing the prior execution? > > > >> > > > > > >
