john-jac commented on PR #30259: URL: https://github.com/apache/airflow/pull/30259#issuecomment-1520841606
Hi Folks...I'd like to weigh in here. Airflow treats all secrets backends the same, and runs every single connection, variable, and configuration through them every time they are needed. However, the impact to users is not the same for all backends. Some backends introduce latency, some incur costs per API calls, and some, like Secrets Manager, result in both. As I user, I want to control how often secrets are retrieved from source. The rest of the time I expect Airflow to just use the same value it retrieved a few minutes, or even seconds, earlier. It's not as simple as improving DAGs, as many of those calls happen outside of a user's control. That is why a cache is so important, and it is key for users to improve performance and reduce costs. It should not just be on variables, but connections too. Take the following example: I use Snowflake in my data lake. I store the credentials in Secrets Manager. I have 2,000 tables that update hourly from Airflow, each with a Snowflake operator. That is 24 X 2,000 X 30 = 1.44 million monthly Secrets Manager calls, at[ $0.05/10,000 calls](https://aws.amazon.com/secrets-manager/pricing/) that's an extra monthly charge of $144 to pull the same connection over and over again. And that's with only 2,000 tasks per hour--lots of users have far greater usage than that. A bit of code to cache that data will reduce customer cost and improve performance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
