[ https://issues.apache.org/jira/browse/AIRFLOW-2480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Javier Domingo Cansino updated AIRFLOW-2480: -------------------------------------------- Description: Currently airflow runs on a date basis. All the scheduling and running logic runs on thinking that ETLs depend on the date they are run. However, there are another set of usecases where it's not the date what varies, but the dataset itself. One example application is when treating genomic data. This data doesn't change, but the usecase is to run all DAGs you may have on samples, rather than dates. This can also be applied to when one has services that rely on making a set of operation on a dataset once. For now, one way to solve this is by creating a DAG per user, scheduling it with None, and triggering it manually from the UI/cli, however it has the drawback that there is only one column in the dates, as new datasets will just create new DAGs. Of course, backfill processes would be applied to run an specific DAG on all the samples, rather than just an specific one. The features of such system would be as follows: * Dates are irrelevant, different dates will have the same output in the same dataset, so only one run per dataset is required * Date based scheduling is irrelevant, and addition of new datasets is the only thing that would trigger new DAGRuns There are a few questions I would like to ask: * How accoplated is the current design of the scheduler/executors in airflow to dates? * Is this a contribution someone would be interested in (besides me)? * Is there any work in progress on a similar feature? Cheers, Javier was: Currently airflow runs on a date basis. All the scheduling and running logic runs on thinking that ETLs depend on the date they are run. However, there are another set of usecases where it's not the date what varies, but the dataset itself. One example application is when treating genomic data. This data doesn't change, but the usecase is to run all DAGs you may have on samples, rather than dates. This can also be applied to when one has services that rely on making a set of operation on a dataset once. For now, one way to solve this is by creating a DAG per user, scheduling it with None, and triggering it manually from the UI/cli, however it has the drawback that there is only one column in the dates, as new datasets will just create new DAGs. Of course, backfill processes would be applied to run an specific DAG on all the samples, rather than just an specific one. The features of such system would be as follows: * Dates are irrelevant, different dates will have the same output in the same dataset, so only one run per dataset is required * Date based scheduling is irrelevant, and addition of new There are a few questions I would like to ask: * How accoplated is the current design of the scheduler/executors in airflow to dates? * Is this a contribution someone would be interested in (besides me)? * Is there any work in progress on a similar feature? Cheers, Javier > DAGs per dataset instead of per date > ------------------------------------ > > Key: AIRFLOW-2480 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2480 > Project: Apache Airflow > Issue Type: New Feature > Reporter: Javier Domingo Cansino > Priority: Major > > Currently airflow runs on a date basis. All the scheduling and running logic > runs on thinking that ETLs depend on the date they are run. However, there > are another set of usecases where it's not the date what varies, but the > dataset itself. > One example application is when treating genomic data. This data doesn't > change, but the usecase is to run all DAGs you may have on samples, rather > than dates. This can also be applied to when one has services that rely on > making a set of operation on a dataset once. > For now, one way to solve this is by creating a DAG per user, scheduling it > with None, and triggering it manually from the UI/cli, however it has the > drawback that there is only one column in the dates, as new datasets will > just create new DAGs. > Of course, backfill processes would be applied to run an specific DAG on all > the samples, rather than just an specific one. > The features of such system would be as follows: > * Dates are irrelevant, different dates will have the same output in the > same dataset, so only one run per dataset is required > * Date based scheduling is irrelevant, and addition of new datasets is the > only thing that would trigger new DAGRuns > There are a few questions I would like to ask: > * How accoplated is the current design of the scheduler/executors in airflow > to dates? > * Is this a contribution someone would be interested in (besides me)? > * Is there any work in progress on a similar feature? > > Cheers, Javier -- This message was sent by Atlassian JIRA (v7.6.3#76005)