ashb commented on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-916396684
Forgive me, it's late and been a long day, so I may not be as lucid as I'd
hope.
My main concern here is about being able to reason about what a DAG will do.
By adding the ability to add arbitrary pre_execute code before any operator in
a DAG we end up in a world where it is very hard to look at a DAG and
understand what it's going to do.
> So in this case, we can't start processing the data before we know it's
come in (let's assume that this is entirely based on time of day, you can't
"sense" it).
I dispute the 'you can't "sense" it'. Strongly. And the processing based on
timing along is the worst possible idea -- removing arbitray time delays
between tasks was one of the main reasons that Airflow has dependencies between
tasks.
The evolution of data processing worfklow often goes:
- Oh, I've only got one thing to run, I can put it on cron
- Oh and a second one, but its unrelated to the first, I can cron that to
- Now I want to combine those two outputs, it's okay I'll just delay it by
an hour.
That approach will work for months. Right up until you hit an inflection
point (more users, more processing) and then suddenly your entire pipeline is
in an inconsistent state (maybe you combined data from two different days.
Maybe you might not notice it for a month. This is not hyperbole, but lived
experience.)
As for "you can't sense it": Either it's a file on disk/s3/blob store, or a
table in a DB, but if you are about to have an operator process it (i.e. read
it or copy it), then you can, by definition, sense if it's there or not.
To the "skip expensive operation if dev": I've not seen anyone ask for that
-- read/write to different bucket in different envs plenty of time, but never
skip an operation entirely based on env (cos if you've skipped one step, you
have to skip the entire "branch" too.) I had a quick search on
https://apache-airflow.slack-archives.org and couldn't find it -- you might
have a better idea what to search for (it uses postgress full text search so
the stemming might be a bit simplistic)
No, there's nothing I have planned in the AIPs I hinted at in my keynote
(most of them are just ideas anyway at this stage)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]