Re: The need for LocalTaskJob

Ash Berlin-Taylor Sat, 04 Aug 2018 13:59:07 -0700


> On 4 Aug 2018, at 21:25, Bolke de Bruin <[email protected]> wrote:
> 
> We can just execute “python” just fine. Because it will run in a separate 
> interpreter no issues will come from sys.modules as that is not inherited. 
> Will still parse DAGs in a separate process then. Forking (@ash) probably 
> does not work as that does share sys.modules.


Some sharing of modules was my idea - if we are careful about what modules we 
load, and we only load the airflow core pre fork, and don't parse any DAG 
pre-fork, then forking sharing currently loaded modules is a good thing for 
speed. Think of it like the preload_app option to a gunicorn worker, where the 
master loads the app and then forks.

> [snip]
> 
> I’m writing AIP-2 
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-2+Simplify+process+launching
>  to work this out.

Sounds good. I'm not proposing we try my forking idea yet, and your proposal is 
a definite improvement from where we are now.

> 
> B.
> 
> Verstuurd vanaf mijn iPad
> 
>> Op 4 aug. 2018 om 19:40 heeft Ash Berlin-Taylor 
>> <[email protected]> het volgende geschreven:
>> 
>> Comments inline.
>> 
>>> On 4 Aug 2018, at 18:28, Maxime Beauchemin <[email protected]> 
>>> wrote:
>>> 
>>> Let me confirm I'm understanding this right, we're talking specifically
>>> about the CeleryExecutor not starting and `airflow run` (not --raw)
>>> command, and fire up a LocalTaskJob instead? Then we'd still have the
>>> worker fire up the `airflow run --raw` command?
>>> 
>>> Seems reasonable. One thing to keep in mind is the fact that shelling out
>>> guarantees no `sys.module` caching, which is a real issue for slowly
>>> changing DAG definitions. That's the reason why we'd have to reboot the
>>> scheduler periodically before it used sub-processes to evaluate DAGs. Any
>>> code that needs to evaluate a DAG should probably be done in a subprocess.
>> 
>>> 
>>> Shelling out also allows for doing things like unix impersonation and
>>> applying CGROUPS. This currently happens between `airflow run` and `airflow
>>> run --raw`. The parent process also does heartbeat and listen for external
>>> kill signal (kill pills).
>>> 
>>> I think what we want is smarter executors and only one level of bash
>>> command: the `airflow run --raw`, and ideally the system that fires this up
>>> is not Airflow itself, and cannot be DAG-aware (or it will need to get
>>> restarted to flush the cache).
>> 
>> Rather than shelling out to `airflow run` could we instead fork and run the 
>> CLI code directly? This involves parsing the config twice, loading all of 
>> the airflow and SQLAlchemy deps twice etc. This I think would account for a 
>> not-insignificant speed difference for the unit tests. In the case of 
>> impersonation we'd probably have no option but to exec `airflow`, but 
>> most(?) people don't use that?
>> 
>> Avoiding the extra parsing pentalty and process when we don't need it might 
>> be worth it for test speed up alone. And we've already got impersonation 
>> covered in the tests so we'll know that it still works.
>> 
>>> 
>>> To me that really brings up the whole question of what should be handled by
>>> the Executor, and what belongs in core Airflow. The Executor needs to do
>>> more, and Airflow core less.
>> 
>> I agree with the sentiment that Core should do less and Executors more -- 
>> many parts of the core are reimplementing what Celery itself could do.
>> 
>> 
>>> 
>>> When you think about how this should all work on Kubernetes, it looks
>>> something like this:
>>> * the scheduler, through KubeExecutor, calls the k8s API, tells it to fire
>>> up and Airflow task
>>> * container boots up and starts an `airflow run --raw` command
>>> * k8s handles heartbeats, monitors tasks, knows how to kill a running task
>>> * the scheduler process (call it supervisor), talks with k8s through
>>> KubeExecutor
>>> and handles zombie cleanup and sending kill pills
>>> 
>>> Now because Celery doesn't offer as many guarantees it gets a bit more
>>> tricky. Is there even a way to send a kill pill through Celery? Are there
>>> other ways than using a parent process to accomplish this?
>> 
>> It does 
>> http://docs.celeryproject.org/en/latest/userguide/workers.html#revoke-revoking-tasks
>>  (at least it does now)
>> 
>>> 
>>> At a higher level, it seems like we need to move more logic from core
>>> Airflow into the executors. For instance, the heartbeat construct should
>>> probably be 100% handled by the executor, and not an assumption in the core
>>> code base.
>>> 
>>> I think I drifted a bit, hopefully that's still helpful.
>>> 
>>> Max

Re: The need for LocalTaskJob

Reply via email to