Re: 1.8.0 Backfill Clarification

Maxime Beauchemin Wed, 19 Apr 2017 09:17:22 -0700

Nice! Thanks Bolke for being so quick!

Arguably real progress isn't really achieved until it's protected by a unit
test, and I'm most guilty for writing tons of untested features...


The backfill logic has always been surprisingly intricate and complex, and
increasingly so as we've added more features. Committers have spoken many
times about the end goal being to use the scheduler logic to process
backfills, as opposed to having some of that logic duplicated. The backfill
then becomes some sort of instance of the scheduler that is limited in
scope.

Another approach would be to simply get backfill to a good place and treat
it as "core", meaning touching it as little as possible. Embrace the logic
duplication. I think that's the plan for the time being.

I also want to re-state the requirement of being able to run local
backfills on mutated DAGs. By that I mean that someone should be able to
branch off in its DAG repo, alter a DAG, and run a local backfill, and for
that altered DAG logic to be applied for the scope of the backfill. There
are a lot of "DAG surgeries" that can [only] be accomplished in that way,
and it's a powerful tool that can be the only way to solve complex
problems. Say if you wanted to backfill 2016 data with the logic used last
May, or if you want to alter the SQL for a specific task while you re-run
some process for a date range, taking into account a specific data quality
bug that should never be seen again. It's really important to allow for
that as Airflow doesn't really allow for having different versions of the
DAG over time. Note that currently you can do that locally (`airflow
backfill --local`), but there's not way to do that with a non-local
(remote) backfill. A better support for versioning, along with this
abstraction around "fetching a DAG artifact" and the related "git time
machine" idea,  could allow for this, where you'd be able to point to a
DAG's version (say an arbitrary git ref) for the scope of the [remote]
backfill.

Max

On Wed, Apr 19, 2017 at 4:51 AM, Bolke de Bruin <[email protected]> wrote:

> PR is out: https://github.com/apache/incubator-airflow/pull/2247
>
> Includes tests.
>
> - Bolke
>
> > On 19 Apr 2017, at 05:33, Bolke de Bruin <[email protected]> wrote:
> >
> > Agreed. This is a bug and imho a blocker for 1.8.1.
> >
> > My bad: re-implementation and lack of sufficient unit tests is what is
> causing this.
> >
> > I'll have a look at this asap.
> >
> > Bolke.
> >
> > Sent from my iPhone
> >
> >> On 19 Apr 2017, at 02:52, Maxime Beauchemin <[email protected]>
> wrote:
> >>
> >> @Chris this is not the way backfill was designed originally and to me
> >> personally I'd flag the behavior you describe as a bug.
> >>
> >> To me, backfill should just "fill in the holes", whether the state came
> >> from a previous backfill run, or the scheduler.
> >>
> >> `airflow backfill` was originally designed to be used in conjunction
> with
> >> `airflow clear` when needed and together they should allow to perform
> >> whatever "surgery" you may have to do. Clear has a lot of options (from
> >> memory) to do date range, task_id regex matching, only_failures,... and
> so
> >> does backfill. So first you'd issue one or more clear commands to empty
> the
> >> false positives and [typically] its descendants, or clearing the whole
> DAG
> >> if you wanted to rerun the whole thing, thus creating the void for
> backfill
> >> to fill in.
> >>
> >> @committers, has that changed?
> >>
> >> Max
> >>
> >> On Tue, Apr 18, 2017 at 3:53 PM, Paul Zaczkiewicz <[email protected]>
> >> wrote:
> >>
> >>> I asked a very similar question last month and got no responses. Note
> that
> >>> SubDags execute backfill commands in in 1.8.0. The original text of
> that
> >>> question is as follows:
> >>>
> >>> I've recently upgraded to 1.8.0 and immediately encountered the hanging
> >>> SubDag issue that's been mentioned. I'm not sure the rollback from rc5
> to
> >>> rc4 fixed the issue.  For now I've removed all SubDags and put their
> >>> task_instances in the main DAG.
> >>>
> >>> Assuming this issue gets fixed, how is one supposed to recover from
> >>> failures within SubDags after the # of retries have maxed?
> Previously, I
> >>> would clear the state of the offending tasks and run a backfill job.
> >>> Backfill jobs in 1.7.1 would skip successful task_instances and only
> run
> >>> the task_instances with cleared states. Now, backfills and
> SubDagOperators
> >>> clear the state of successful tasks. I'd rather not re-run a task that
> >>> already succeeded. I tried running backfills with --task_regex and
> >>> --ignore_dependencies, but that doesn't quite work either.
> >>>
> >>> If I have t1(success) -> t2(clear) -> t3(clear) and I set --task_regex
> so
> >>> that it excludes t1, then t2 will run, but t3 will never run because it
> >>> doesn't wait for t2 to finish. It fails because its upstream dependency
> >>> condition is not met.
> >>>
> >>> I like the logical grouping that SubDags provide, but I don't want all
> >>> retry all tasks even if they're successful. I can see why one would
> want
> >>> that behavior in some cases, but it's certainly not useful in all.
> >>>
> >>>> On Tue, Apr 18, 2017 at 6:45 PM, Chris Fei <[email protected]> wrote:
> >>>>
> >>>> Hi all,
> >>>>
> >>>>
> >>>>
> >>>> I'm new to Airflow, and I'm looking for someone to clarify the
> expected
> >>>> behavior of running a backfill with regard to previously successful
> >>>> tasks. When I run a backfill on 1.8.0, tasks that were previously run
> >>>> successfully are re-run for me. Is it expected that backfills re-run
> all
> >>>> tasks, even those that were marked as successful? For reference, the
> >>>> command I'm running is `airflow backfill -s 2017-04-01 -e 2017-04-03
> >>>> Tutorial`.
> >>>>
> >>>>
> >>>> I wasn't able to find anything in the documentation to indicate either
> >>>> which way. Some brief research revealed that invoking backfill was
> meant
> >>>> at one point to "fill in the blanks", which I interpret to mean "only
> >>>> run tasks that were not completed successfully". On the contrary, the
> >>>> code *does* seem to explicitly set all task instances for a given
> DAGRun
> >>>> to SCHEDULED (see [AIRFLOW-910][1] and
> >>>> https://github.com/apache/incubator-airflow/pull/2107/files#diff-
> >>>> 54a57ccc2c8e73d12c812798bf79ccb2R1816).
> >>>>
> >>>>
> >>>> Apologies for such a fundamental question, just want to make sure I'm
> >>>> not missing something obvious here. Can someone clarify?
> >>>>
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Chris Fei
> >>>>
> >>>>
> >>>> Links:
> >>>>
> >>>> 1. https://issues.apache.org/jira/browse/AIRFLOW-910
> >>>>
> >>>
>
>

Re: 1.8.0 Backfill Clarification

Reply via email to