On top of that we can expire the cache in order of few times of scheduler runs(5 or 10 times one scheduler run time)
On Mon 22 Oct, 2018, 16:27 Sai Phanindhra, <phani8...@gmail.com> wrote: > Thats true. But variable wont change very frequently. We can cache these > variables in some place outside airflow ecosystem. Something like redis or > memcache. As queries to these dbs are fast. We can reduce the latency and > decrease the number of connections to main database. This whole assumption > need to be benchmarked to prove the point. I feel like its worth a try. > > On Mon 22 Oct, 2018, 15:47 Ash Berlin-Taylor, <a...@apache.org> wrote: > >> Cache them where? When would it get invalidated? Given the DAG parsing >> happens in a sub-process how would the cache live longer than that process? >> >> I think the change might be to use a per-process/per-thread SQLA >> connection when parsing dags, so that if a DAG needs access to the metadata >> DB it does it with just one connection rather than N. >> >> -ash >> >> > On 22 Oct 2018, at 11:11, Sai Phanindhra <phani8...@gmail.com> wrote: >> > >> > Who don't we cache variables? We can fairly assume that variables won't >> get >> > changed very frequently(not as frequent as scheduler DAG run time). We >> can >> > keep default timeout to few times scheduler run time. This will help >> > control number of connections to database and reduces load both on >> > scheduler and database. >> > >> > On Mon 22 Oct, 2018, 13:34 Marcin Szymański, <ms32...@gmail.com> wrote: >> > >> >> Hi >> >> >> >> You are right, it's a sure way to saturate db connections, as a >> connection >> >> is established every few seconds when the DAGs are parsed. The same >> happens >> >> when you use variables in __init__ of an operator. Os environment >> variable >> >> would be safer for your need. >> >> >> >> Marcin >> >> >> >> >> >> On Mon, 22 Oct 2018, 08:34 Pramiti Goel, <pramitigoe...@gmail.com> >> wrote: >> >> >> >>> Hi, >> >>> >> >>> We want to make owner and email Id general, so we don't want to put in >> >>> airflow dag. Using variables will help us in changing the email/owner >> >>> later, if there are lot of dags of same owner. >> >>> >> >>> For example: >> >>> >> >>> >> >>> default_args = { >> >>> 'owner': Variable.get('test_owner_de'), >> >>> 'depends_on_past': False, >> >>> 'start_date': datetime(2018, 10, 17), >> >>> 'email': Variable.get('de_infra_email'), >> >>> 'email_on_failure': True, >> >>> 'email_on_retry': True, >> >>> 'retries': 2, >> >>> 'retry_delay': timedelta(minutes=1)} >> >>> >> >>> >> >>> Looking into the code of Airflow, it is making connection session >> >> everytime >> >>> the variable is created, and then close it. (Let me know if I >> understand >> >>> wrong). If there are many dags with variables in default args running >> >>> parallel, querying variable table in MySQL, will it have any sort of >> >>> limitation on number of sessions of SQLAlchemy ? Will that make dag >> slow >> >> as >> >>> there will be many queries to mysql for each dag? is the above >> approach >> >>> good ? >> >>> >> >>>> using Airlfow 1.9 >> >>> >> >>> Thanks, >> >>> Pramiti. >> >>> >> >> >> >>