Yes clearly this area needs TLC. Thanks for getting the ball rolling. Max
On Sat, Aug 4, 2018 at 1:58 PM Ash Berlin-Taylor < [email protected]> wrote: > > > On 4 Aug 2018, at 21:25, Bolke de Bruin <[email protected]> wrote: > > > > We can just execute “python” just fine. Because it will run in a > separate interpreter no issues will come from sys.modules as that is not > inherited. Will still parse DAGs in a separate process then. Forking (@ash) > probably does not work as that does share sys.modules. > > Some sharing of modules was my idea - if we are careful about what modules > we load, and we only load the airflow core pre fork, and don't parse any > DAG pre-fork, then forking sharing currently loaded modules is a good thing > for speed. Think of it like the preload_app option to a gunicorn worker, > where the master loads the app and then forks. > > > [snip] > > > > I’m writing AIP-2 > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-2+Simplify+process+launching > to work this out. > > Sounds good. I'm not proposing we try my forking idea yet, and your > proposal is a definite improvement from where we are now. > > > > > B. > > > > Verstuurd vanaf mijn iPad > > > >> Op 4 aug. 2018 om 19:40 heeft Ash Berlin-Taylor < > [email protected]> het volgende geschreven: > >> > >> Comments inline. > >> > >>> On 4 Aug 2018, at 18:28, Maxime Beauchemin <[email protected]> > wrote: > >>> > >>> Let me confirm I'm understanding this right, we're talking specifically > >>> about the CeleryExecutor not starting and `airflow run` (not --raw) > >>> command, and fire up a LocalTaskJob instead? Then we'd still have the > >>> worker fire up the `airflow run --raw` command? > >>> > >>> Seems reasonable. One thing to keep in mind is the fact that shelling > out > >>> guarantees no `sys.module` caching, which is a real issue for slowly > >>> changing DAG definitions. That's the reason why we'd have to reboot the > >>> scheduler periodically before it used sub-processes to evaluate DAGs. > Any > >>> code that needs to evaluate a DAG should probably be done in a > subprocess. > >> > >>> > >>> Shelling out also allows for doing things like unix impersonation and > >>> applying CGROUPS. This currently happens between `airflow run` and > `airflow > >>> run --raw`. The parent process also does heartbeat and listen for > external > >>> kill signal (kill pills). > >>> > >>> I think what we want is smarter executors and only one level of bash > >>> command: the `airflow run --raw`, and ideally the system that fires > this up > >>> is not Airflow itself, and cannot be DAG-aware (or it will need to get > >>> restarted to flush the cache). > >> > >> Rather than shelling out to `airflow run` could we instead fork and run > the CLI code directly? This involves parsing the config twice, loading all > of the airflow and SQLAlchemy deps twice etc. This I think would account > for a not-insignificant speed difference for the unit tests. In the case of > impersonation we'd probably have no option but to exec `airflow`, but > most(?) people don't use that? > >> > >> Avoiding the extra parsing pentalty and process when we don't need it > might be worth it for test speed up alone. And we've already got > impersonation covered in the tests so we'll know that it still works. > >> > >>> > >>> To me that really brings up the whole question of what should be > handled by > >>> the Executor, and what belongs in core Airflow. The Executor needs to > do > >>> more, and Airflow core less. > >> > >> I agree with the sentiment that Core should do less and Executors more > -- many parts of the core are reimplementing what Celery itself could do. > >> > >> > >>> > >>> When you think about how this should all work on Kubernetes, it looks > >>> something like this: > >>> * the scheduler, through KubeExecutor, calls the k8s API, tells it to > fire > >>> up and Airflow task > >>> * container boots up and starts an `airflow run --raw` command > >>> * k8s handles heartbeats, monitors tasks, knows how to kill a running > task > >>> * the scheduler process (call it supervisor), talks with k8s through > >>> KubeExecutor > >>> and handles zombie cleanup and sending kill pills > >>> > >>> Now because Celery doesn't offer as many guarantees it gets a bit more > >>> tricky. Is there even a way to send a kill pill through Celery? Are > there > >>> other ways than using a parent process to accomplish this? > >> > >> It does > http://docs.celeryproject.org/en/latest/userguide/workers.html#revoke-revoking-tasks > (at least it does now) > >> > >>> > >>> At a higher level, it seems like we need to move more logic from core > >>> Airflow into the executors. For instance, the heartbeat construct > should > >>> probably be 100% handled by the executor, and not an assumption in the > core > >>> code base. > >>> > >>> I think I drifted a bit, hopefully that's still helpful. > >>> > >>> Max > >
