[jira] [Updated] (AIRFLOW-2480) DAGs per dataset instead of per date

Javier Domingo Cansino (JIRA) Thu, 17 May 2018 04:26:04 -0700

     [ 
https://issues.apache.org/jira/browse/AIRFLOW-2480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Javier Domingo Cansino updated AIRFLOW-2480:
--------------------------------------------
    Description: 
Currently airflow runs on a date basis. All the scheduling and running logic 
runs on thinking that ETLs depend on the date they are run. However, there are 
another set of usecases where it's not the date what varies, but the dataset 
itself.

One example application is when treating genomic data. This data doesn't 
change, but the usecase is to run all DAGs you may have on samples, rather than 
dates. This can also be applied to when one has services that rely on making a 
set of operation on a dataset once.

For now, one way to solve this is by creating a DAG per user, scheduling it 
with None, and triggering it manually from the UI/cli, however it has the 
drawback that there is only one column in the dates, as new datasets will just 
create new DAGs.

Of course, backfill processes would be applied to run an specific DAG on all 
the samples, rather than just an specific one.

The features of such system would be as follows:

 * Dates are irrelevant, different dates will have the same output in the same 
dataset, so only one run per dataset is required

 * Date based scheduling is irrelevant, and addition of new datasets is the 
only thing that would trigger new DAGRuns

There are a few questions I would like to ask:

 * How accoplated is the current design of the scheduler/executors in airflow 
to dates?

 * Is this a contribution someone would be interested in (besides me)?

 * Is there any work in progress on a similar feature?

 

Cheers, Javier

  was:
Currently airflow runs on a date basis. All the scheduling and running logic 
runs on thinking that ETLs depend on the date they are run. However, there are 
another set of usecases where it's not the date what varies, but the dataset 
itself.

One example application is when treating genomic data. This data doesn't 
change, but the usecase is to run all DAGs you may have on samples, rather than 
dates. This can also be applied to when one has services that rely on making a 
set of operation on a dataset once.

For now, one way to solve this is by creating a DAG per user, scheduling it 
with None, and triggering it manually from the UI/cli, however it has the 
drawback that there is only one column in the dates, as new datasets will just 
create new DAGs.

Of course, backfill processes would be applied to run an specific DAG on all 
the samples, rather than just an specific one.

The features of such system would be as follows:

 * Dates are irrelevant, different dates will have the same output in the same 
dataset, so only one run per dataset is required

 * Date based scheduling is irrelevant, and addition of new

There are a few questions I would like to ask:

 * How accoplated is the current design of the scheduler/executors in airflow 
to dates?

 * Is this a contribution someone would be interested in (besides me)?

 * Is there any work in progress on a similar feature?

 

Cheers, Javier


> DAGs per dataset instead of per date
> ------------------------------------
>
>                 Key: AIRFLOW-2480
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-2480
>             Project: Apache Airflow
>          Issue Type: New Feature
>            Reporter: Javier Domingo Cansino
>            Priority: Major
>
> Currently airflow runs on a date basis. All the scheduling and running logic 
> runs on thinking that ETLs depend on the date they are run. However, there 
> are another set of usecases where it's not the date what varies, but the 
> dataset itself.
> One example application is when treating genomic data. This data doesn't 
> change, but the usecase is to run all DAGs you may have on samples, rather 
> than dates. This can also be applied to when one has services that rely on 
> making a set of operation on a dataset once.
> For now, one way to solve this is by creating a DAG per user, scheduling it 
> with None, and triggering it manually from the UI/cli, however it has the 
> drawback that there is only one column in the dates, as new datasets will 
> just create new DAGs.
> Of course, backfill processes would be applied to run an specific DAG on all 
> the samples, rather than just an specific one.
> The features of such system would be as follows:
>  * Dates are irrelevant, different dates will have the same output in the 
> same dataset, so only one run per dataset is required
>  * Date based scheduling is irrelevant, and addition of new datasets is the 
> only thing that would trigger new DAGRuns
> There are a few questions I would like to ask:
>  * How accoplated is the current design of the scheduler/executors in airflow 
> to dates?
>  * Is this a contribution someone would be interested in (besides me)?
>  * Is there any work in progress on a similar feature?
>  
> Cheers, Javier



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (AIRFLOW-2480) DAGs per dataset instead of per date

Reply via email to