[GitHub] [airflow] vandonr-amz opened a new pull request, #30259: Add a cache to Variable.get

via GitHub Thu, 23 Mar 2023 10:02:19 -0700


vandonr-amz opened a new pull request, #30259:
URL: https://github.com/apache/airflow/pull/30259


   A lot of users setup `Variable.get` calls for a bunch of configuration 
options in their DAGs with reasonable defaults, just to be able to set them 
dynamically if needed.
   Because those variables are not set, we have to look in the custom backend 
(usually an http request), the env variables (that particular step is not a 
pb), and the Metastore (DB call) each time.
   Variables are often gotten at the top of the dag files, so those calls are 
made for every DAG parsing pass, slowing down the whole process.
   
   Furthermore, accessing remote secret managers is rarely free, and having a 
call made for each variable every time a dag file is parsed can yield high (and 
unexpected) API charges for cloud computing users.
   [AWS secret manager](https://aws.amazon.com/secrets-manager/pricing/): $0.05 
per 10,000 API calls
   [GCP secret manager](https://cloud.google.com/secret-manager/pricing): $0.03 
per 10,000 operations
   [Azure key 
vault](https://azure.microsoft.com/en-ca/pricing/details/key-vault/): 
$0.03/10,000 transactions
   
   An alternative could be to setup a cache in those specific secret backends, 
but then it multiplies the work required. And since the custom backend is 
always the first thing that's been looked at, the behavior would be the same.
   
   I'm creating this cache enabled by default, and with a default TTL of 15 
minutes, which seems like a good compromise between reducing the number of 
calls and still keeping some reactivity to changes.
   
   Since most of the Variable getting is happening in isolated processes during 
DAG parsing, I'm using a 
[Manager](https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Manager)
 to have a synchronized cache, handled under the hood in its own process.
   I tried to keep the impact on the codebase as low as possible. It only 
requires an init call before the DAG parsing processes are forked, to make sure 
the cache has been initialized in the parent process, and will be accessible 
from the children via the copied memory.
   
   For now, there is no bound to the cache size, and no cleaning. If it is 
estimated that this can be a problem, we can setup a cleaning step, for 
instance just before DAG parsing, where I inserted the init() call.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] vandonr-amz opened a new pull request, #30259: Add a cache to Variable.get

Reply via email to