Re: Task-level scheduling

Daniel Standish Sun, 07 Nov 2021 18:37:06 -0800

typo:
>
> I'm a bit skeptical that there would be a good interface for like a
> task-level-timetable-override if that's where you're going -- like a way
> for each schedule *task *to have its own timetable override that's
> processed within the dag run.
>


On Sun, Nov 7, 2021 at 6:34 PM Daniel Standish <[email protected]> wrote:

> A bit confused about what you're proposing Malthe.
>
> The thread subject is "task-level schedulling" but it says "there is no
> such interface to control
> task-level scheduling – or more specifically, the ability to control which
> DAG runs to skip."
>
> This makes it sound like you're talking about being able to have logic to
> skip a dag.
>
> If we're really talking about skipping dag runs, I suppose it depends what
> the skip logic is.   If it's date based you could use the timetable
> interface to do so, otherwise yeah it would have to be done from a task.  I
> don't think of a maybe-skip task as causing that much overhead.
>
> But maybe you're talking about having skip logic within each task that
> isn't managed by a branch operator or similar.
>
> It's fairly easy to achieve this already with a subclass.
>
> class SkippyBashOperator(BashOperator):
>     def execute(context):
>         if certain_conditions:
>             raise AirflowSkipException("I skipped")
>         super().execute(context)
>
> re
>
> It should be evident from the task execution details why the task was
>> skipped – the interface should provide the necessary string
>> representation functionality.
>>
>
> i think that being able to put a `reason` in AirlfowSkipException and have
> that be somehow visible in UI could be good. Also, being able to specify
> `color` could also be nice.
>
> re
>
>> Main problem of such an approach (currently) is that there is no
>> visual indication that a specific task was actually "skipped" rather
>> than "executed"
>>
>
> The current UI _does_ already indicate when a task is skipped -- it's a
> light pink color.  But maybe I misunderstand something.
>
> I'm a bit skeptical that there would be a good interface for like a
> task-level-timetable-override if that's where you're going -- like a way
> for each schedule to have its own timetable override that's processed
> within the dag run.
>
> Or are you proposing that we allow tasks to exist and be schedulable
> without being associated with a dag?
>
>
> On Sun, Nov 7, 2021 at 2:07 PM Jarek Potiuk <[email protected]> wrote:
>
>> I have mixed feelings about that one.
>>
>> I understand that sometimes you would like to skip processing of a
>> single task using schedule "criteria", but IMHO that creates some
>> ambiguities and the necessity of more complex logic for the remainder
>> of DAGs. One thing is that you cannot really specify "output" for such
>> a task. If it is skipped then it does not produce any output, so the
>> logic of skipping previous tasks should be included in the following
>> tasks.
>>
>> You really need to say "if another task was skipped, don't use it's
>> output". Or similar. Another variant of this is "if the output does
>> not exist", but that's a bit of implicit behaviour and does not play
>> well with - some edge cases - for example - adding such conditional
>> skip and back-filling history. If you base it on the existence of
>> output backfilling will not work.
>>
>> So if you have tasks that pass any outputs, this approach might be
>> quite problematic IMHO.
>>
>> Somehow I have a feeling that it's much easier to this kind of
>> skipping using either the branch operator mentioned by Ash (then the
>> output might be either "prepared' or "empty" depending on the branch)
>> or even have custom PythonOperator/@task callable where you produce
>> either "prepared" or "empty" output based on some time/cron logic.
>> That somehow feels much more consistent and much more flexible, as you
>> can base your decision on how you are running the task based on more
>> criteria..
>>
>> Main problem of such an approach (currently) is that there is no
>> visual indication that a specific task was actually "skipped" rather
>> than "executed". This is true, but as I see it (and this is a much
>> more generic approach) as an opportunity here to add such an indicator
>> of task execution "flavour".
>>
>> When a task was executed it could be executed this or that way and
>> mark it's status as such. This could be an icon, border, text color (I
>> guess Brent can come with some ideas here) - some way to indicate that
>> the task in this "dagrun" was run differently than in the "yesterdays"
>> one. And it would not be limited to a "time schedule" difference only.
>> It could be based on much more complex and different criteria. If we
>> actually "execute" such a task rather than just "skip/run" we have the
>> powers of custom Python code working for us.
>>
>> I can imagine very different "flavours" of execution:
>>
>> * based on time of day/week etc. (your case)
>> * based on amount of data to process
>> * based on number of errors/warnings encountered during processing
>> * based on type of data seen
>>
>> Also the problem with "task-based schedule" is that due to the
>> scheduler that cannot run any custom DAG-writer provided code, the
>> flexibility of timeline logic is limited to whatever has been
>> installed as a plugin by the admin. If we assume that "task execution"
>> actually happens for such "conditional execution tasks", then we can
>> run a code which has been written by the DAG writer - which adds to
>> the flexibility of task execution logic and this flexibility is
>> infinite.
>>
>> I feel that adding a timeline-only "flavour" of a run is very limiting
>> and we can do better.
>>
>> But I am happy to discuss it stil.
>>
>> J.
>>
>> On Thu, Nov 4, 2021 at 12:04 PM Ash Berlin-Taylor <[email protected]> wrote:
>> >
>> > For context, the reason Malthe is proposing something like this, and
>> doesn't want to use the "existing" approach of a BranchOperator or similar
>> is optimization: Having to spin up a task to make a decision is, in many
>> cases, not necessary and the scheduler could make this decision quickly.
>> >
>> > (This is along similar lines to why we no longer schedule or actually
>> run DummyOperator but just mark it as success directly in the scheduler.)
>> >
>> > AIP-39 is a little unclear on how the new "logical_date" value changes
>> with the different timetable implementations or if it's simply used
>> internally for sorting purposes and not meaningful on its own. For this
>> proposal to work, there has to be a well-defined "execution date" that we
>> can compare against.
>> >
>> >
>> >
>> > data_interval_start and/or data_interval_end are the dates you should
>> use for such a purpose
>> >
>> > Please don't use the term execution date -- it is too overloaded and
>> confusing.
>> >
>> > -ash
>> >
>> >
>> > On Mon, Oct 18 2021 at 21:17:22 +0000, Malthe <[email protected]> wrote:
>> >
>> > While AIP-39 provides an interface for more powerful pluggable
>> scheduling behaviours, there is no such interface to control task-level
>> scheduling – or more specifically, the ability to control which DAG runs to
>> skip. Examples: - Skip task execution on certain days - Skip task execution
>> on certain hours which could vary from day to day Whether or not child
>> tasks would be affected by such task scheduling depends on the trigger rule
>> configured on those tasks (e.g. "all_success", "all_done"). The interface
>> might consist of both an include and exclude expression – by default all
>> executions would be included and none excluded. In both cases, the
>> scheduling could be a cron expression but the interface should again
>> support more powerful behaviors. It should be evident from the task
>> execution details why the task was skipped – the interface should provide
>> the necessary string representation functionality. AIP-39 is a little
>> unclear on how the new "logical_date" value changes with the different
>> timetable implementations or if it's simply used internally for sorting
>> purposes and not meaningful on its own. For this proposal to work, there
>> has to be a well-defined "execution date" that we can compare against.
>>
>

Re: Task-level scheduling

Reply via email to