Re: [DISCUSS] a cache for Airflow Variables

Jarek Potiuk Sat, 25 Mar 2023 22:51:43 -0700

Two comments here as well (just technical rather than whether we
should do it or not) - to clarify maybe what the proposal really is.

Hussein:

Yeah I had exactly the same thoughts and that was even my original
response in the PR
https://github.com/apache/airflow/pull/30259#issuecomment-1481610945.
Using redis/memcache is possible, it can help a lot with decreasing
pressure on the DB/Secret especially in the case of multi-node setup.
Perfect alignment here.

However, I think this idea is aiming for something different and solve
a different proble, It's not only decreasing the pressure on an
external DB/Secrets but also (and more importantly) it does it all
in-memory of the instance which we are using to handle all
multi-processing cases (and bases it on the fact that thanks to Ash's
fork implementation few years ago we handle a lot of processing in DFP
with multiprocessing/forking already. In a way I see it as a further
optimisation and building on top of that change that allowed us to use
the memory and CPU much more effectively there thanks to replacing
launching separate interpreter for each processing with forking (which
saves a lot of CPU for initialisation and has the nice benefit of
reusing a lot of memory that is shared following copy-on-write model).
I see this one as a similar optimization (yes, to optimise patterns
that we currently classify as "bad", but we could reclassify them as
"ok")..

It just makes it much more effective to handle the case where the same
variables are used over-and-over on the same machine which handles
parsing (or execution - I will come back to it in a moment). It does
not only save on DB/Secret, but on the network communication at-all in
those cases. In a way it is as if you compare L1/ L2 caches in the
processor with L3 cache on the motherboard. L1/L2 (the proposal from
Raphael) is super-fast and small, whereas L3 (memcache/redis) is much
slower (orders of magnitude difference) - and the end result is that
there is a lot less communication involved. Also, the idea of the
proposal here is that you do not have to complicate the deployment by
adding another component (or assigning additional functions to
existing components). The "in-memory" solution will work for everyone,
without making a single change to their deployment. No matter if they
already use redis/memcache in their deployment (for example when they
use K8S executor, they likely have no redis as they won't need it).
The redis/memcache is a more complete solution for sure as it solves a
different problem (for the multi-node case), but I think it's a bit
like comparing apples to pears.

Daniel:

Just to be precise and clarify - it **might** work for Celery, but it
is a bit more risky because we are not fully aware how Celery uses
multiprocessing/billiard under-the-hood. It will likely work as long
as we do this:

if celery:
    import billard as mp
else:
    import multiprocessing as mp

And then use mp everywhere.

This is a solution that seems to solve all the problems with
multiprocessing/celery which are known. See for example those issues
in our repo: 
https://github.com/apache/airflow/issues?q=is%3Aissue+billiard+is%3Aclosed+.
Using multiprocessing and celery is simply not a really good idea
because there is a reason why they forked it and created billiard
instead (and optimized the in-memory usage there). Billiard is
something we already use in celery (it is used under-the hood and some
modification there are implemented to allow async processing of celery
which makes the two libraries incompatible:
https://github.com/celery/celery/issues/5362#issuecomment-466862879).
So we **could** (providing sufficient testing) enable it also for
celery. However, it is quite a bit more risky, because it could clash
with the way celery uses billiard under the hood. ( I am guessing
there - mostly out of caution, reading many cases in the past where
people had problems with it.

J

On Sun, Mar 26, 2023 at 6:53 AM Daniel Standish
<daniel.stand...@astronomer.io.invalid> wrote:
>
> Surprised to hear that it doesn't work with celery. Is that right?  I
> assumed that this was the main target.
>
> If it's really only a benefit in dag processor, it's surprising that it
> provides much benefit because it should be one call per var-file-parse; in
> worker it will be once per ti and I assumed this would be where the heavy
> calls come from. Maybe I miss something.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
For additional commands, e-mail: dev-h...@airflow.apache.org

Re: [DISCUSS] a cache for Airflow Variables

Reply via email to