shubham22 commented on PR #30259: URL: https://github.com/apache/airflow/pull/30259#issuecomment-1581686531
Hi folks - I would like to rejuvenate this discussion by sharing the broader vision that this PR is leading us towards. Additionally, I intend to present the potential trade-offs to facilitate a more balanced and thorough conversation. ### Improving Secrets handling in Apache Airflow #### Issues with current Secrets handling in Airflow 1. **High Cost of Secret Backends:** When users heavily rely on secrets and utilize secret backends, costs can rapidly escalate, accounting for about 30-40% of Airflow infrastructure expenses. For perspective, Secrets Manager receives triple digit million requests per day to retrieve 'aws_default' from Airflow users (can't share specifics due to confidentiality). 2. **Impact on DAG Performance:** Each time a DAG executes (and gets parsed, applicable only with top-level python code), Airflow retrieves the secrets anew, leading to increased latency in DAG execution (and parsing) time. 3. **Strain on Metadata DB:** Users not utilizing secret backends often overload their Metadata DB due to frequent secret retrievals. #### Long-term Goal Mitigate the performance and cost impacts caused by secret usage on Apache Airflow as an orchestration platform. Users seek low latency in their workflow execution at the lowest possible cost, and secret handling is one area that isn’t optimized #### Potential proposal Implement a caching mechanism for secrets to optimize the retrieval of variables and connections. #### Why Secrets caching? Caching is a well-established method for improving the performance of systems by storing the results of expensive operations and reusing them when the same operation is requested again. Applying this method to secrets handling in Airflow could substantially reduce the number of calls made to retrieve secrets, especially in cases where the same secret is used by multiple tasks or retrieved multiple times during the execution of a DAG. This PR aims to lay the groundwork for the caching of secrets in Airflow, thereby providing a substantial step forward in addressing the performance impact of secrets handling. Of course, there are trade-offs as with all strategies, I’ve laid out trade-offs to consider for a balanced discussion: #### PROS 1. **Performance Improvement:** By reducing the number of calls to retrieve secrets, particularly when using a secret backend with high latency, we can achieve notable performance gains. 2. **General Applicability:** The performance benefits of caching are not limited to users who do not adhere to best practices; all users can benefit, albeit to varying degrees. 3. **Incremental Implementation:** Even if the full benefits of caching are not immediately clear, we can begin by implementing a smaller, incremental improvement and expand upon it over time, similar to the approach that this PR has taken. #### CONS 1. **Cache Invalidation Difficulty:** Caching introduces the challenge of knowing when to update or invalidate the cache. However, while cache invalidation is a complex problem, it is a well-studied area with established best practices and patterns we can follow. 2. **Masking DAG Parsing Inefficiencies:** Caching could potentially conceal underlying issues with DAG parsing. However, this proposal advocates for caching as part of a holistic performance improvement strategy, not as a standalone solution. In parallel, DAG parsing inefficiencies should be addressed. Additionally, we plan to add relevant metrics, such as the number of successful and failed cache hits, to provide operators with visibility into how users are utilizing caching. In conclusion, while secrets caching poses certain challenges, none of these are insurmountable. By incorporating caching as part of a broader strategy to enhance performance and reduce calls to retrieve secrets, we can take a significant step forward in improving the efficiency and scalability of Apache Airflow. cc: @vandonr-amz @john-jac @eladkal @potiuk - I would appreciate if you could review the overarching strategy and share your thoughts on whether we can proceed with this PR, keeping the aforementioned goal in mind. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
