shubham22 commented on PR #30259:
URL: https://github.com/apache/airflow/pull/30259#issuecomment-1581686531

   Hi folks - I would like to rejuvenate this discussion by sharing the broader 
vision that this PR is leading us towards. Additionally, I intend to present 
the potential trade-offs to facilitate a more balanced and thorough 
conversation.
   
   ### Improving Secrets handling in Apache Airflow
   
   #### Issues with current Secrets handling in Airflow
   1. **High Cost of Secret Backends:** When users heavily rely on secrets and 
utilize secret backends, costs can rapidly escalate, accounting for about 
30-40% of Airflow infrastructure expenses. For perspective, Secrets Manager 
receives triple digit million requests per day to retrieve 'aws_default' from 
Airflow users (can't share specifics due to confidentiality).
   2. **Impact on DAG Performance:** Each time a DAG executes (and gets parsed, 
applicable only with top-level python code), Airflow retrieves the secrets 
anew, leading to increased latency in DAG execution (and parsing) time.
   3. **Strain on Metadata DB:** Users not utilizing secret backends often 
overload their Metadata DB due to frequent secret retrievals.
   
   #### Long-term Goal
   Mitigate the performance and cost impacts caused by secret usage on Apache 
Airflow as an orchestration platform. Users seek low latency in their workflow 
execution at the lowest possible cost, and secret handling is one area that 
isn’t optimized
   
   #### Potential proposal
   Implement a caching mechanism for secrets to optimize the retrieval of 
variables and connections.
   
   #### Why Secrets caching?
   Caching is a well-established method for improving the performance of 
systems by storing the results of expensive operations and reusing them when 
the same operation is requested again. Applying this method to secrets handling 
in Airflow could substantially reduce the number of calls made to retrieve 
secrets, especially in cases where the same secret is used by multiple tasks or 
retrieved multiple times during the execution of a DAG.
   
   This PR aims to lay the groundwork for the caching of secrets in Airflow, 
thereby providing a substantial step forward in addressing the performance 
impact of secrets handling. Of course, there are trade-offs as with all 
strategies, I’ve laid out trade-offs to consider for a balanced discussion: 
   
   #### PROS
   1. **Performance Improvement:** By reducing the number of calls to retrieve 
secrets, particularly when using a secret backend with high latency, we can 
achieve notable performance gains.
   2. **General Applicability:** The performance benefits of caching are not 
limited to users who do not adhere to best practices; all users can benefit, 
albeit to varying degrees.
   3. **Incremental Implementation:** Even if the full benefits of caching are 
not immediately clear, we can begin by implementing a smaller, incremental 
improvement and expand upon it over time, similar to the approach that this PR 
has taken.
   
   #### CONS
   1. **Cache Invalidation Difficulty:** Caching introduces the challenge of 
knowing when to update or invalidate the cache. However, while cache 
invalidation is a complex problem, it is a well-studied area with established 
best practices and patterns we can follow.
   2. **Masking DAG Parsing Inefficiencies:** Caching could potentially conceal 
underlying issues with DAG parsing. However, this proposal advocates for 
caching as part of a holistic performance improvement strategy, not as a 
standalone solution. In parallel, DAG parsing inefficiencies should be 
addressed. Additionally, we plan to add relevant metrics, such as the number of 
successful and failed cache hits, to provide operators with visibility into how 
users are utilizing caching.
   
   In conclusion, while secrets caching poses certain challenges, none of these 
are insurmountable. By incorporating caching as part of a broader strategy to 
enhance performance and reduce calls to retrieve secrets, we can take a 
significant step forward in improving the efficiency and scalability of Apache 
Airflow.
   
   cc: @vandonr-amz @john-jac 
   
   @eladkal @potiuk - I would appreciate if you could review the overarching 
strategy and share your thoughts on whether we can proceed with this PR, 
keeping the aforementioned goal in mind. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to