Yep . Per DAG file is what I actually meant :) On Fri, Jun 14, 2024 at 12:26 PM Eugen Kosteev <[email protected]> wrote:
> The thing is that it is "last count per DAG file". > I do not think we can actually calculate this per DAG, well we can split > total number of queries by number of DAGs in the file, but this maybe > confusing. > > On Fri, Jun 14, 2024 at 12:24 PM Jarek Potiuk <[email protected]> wrote: > >> > the cardinality of those logs is too high. >> >> I was thinking about only showing "last count per DAG" - then cardinality >> would be "good enough" I think. It could also be exposed via metrics now >> that I think of it - no real need to see it in UI or API. >> >> On Fri, Jun 14, 2024 at 12:14 PM Kaxil Naik <[email protected]> wrote: >> >>> Yeah, valuable to show it in logs. For showing it in a web server or >>> storing it in DB, the cardinality of those logs is too high. >>> >>> On Fri, 14 Jun 2024 at 11:09, Eugen Kosteev <[email protected]> wrote: >>> >>> > Yeah, I also think it is a good idea to expose it in the Airflow UI. >>> > >>> > Although, atm we do not have an entity such as DAG file (and this >>> metric is >>> > per DAG file) in Airflow database, so we would need to design it a >>> little >>> > bit. >>> > And attaching to the DAG model is not correct. >>> > >>> > But I totally agree, it would be good to have it in Airflow UI as well >>> for >>> > "operation users" to have access to this information. >>> > >>> > On Fri, Jun 14, 2024 at 11:22 AM Jarek Potiuk <[email protected]> >>> wrote: >>> > >>> > > Good idea, it would also be good if we could have access to the >>> > information >>> > > exposed in the UI - so that "operations users" can see it and maybe >>> even >>> > > act on it + API/ CLI to check it. I think in the future of Airflow 3 >>> > where >>> > > we will have task isolation, having `0` for all the DAGs will be a >>> > > prerequisite for switching to "task isolation" mode and this could be >>> > > actually verified in a migration tool. >>> > > >>> > > On Fri, Jun 14, 2024 at 10:59 AM Eugen Kosteev <[email protected]> >>> > wrote: >>> > > >>> > > > Hi. >>> > > > >>> > > > I would like to discuss the proposal of adding a new column to the >>> "DAG >>> > > > File Processing Stats" of DAG processor logs. >>> > > > >>> > > > Currently in the logs of DAG processor, there is following data >>> > > > (screenshot below) that includes # of DAGs, runtime, etc. per DAG >>> file. >>> > > > [image: image.png] >>> > > > >>> > > > It seems that it would be beneficial to have also there data about >>> the >>> > > > number of queries performed to the Airflow database during parsing >>> of >>> > > each >>> > > > file. >>> > > > It maybe convenient to have it in case of debugging issues related >>> to >>> > > high >>> > > > load on Airflow database, e.g. typical scenario is when DAG file(s) >>> > have >>> > > > a lot of queries to database done on the top level of code and >>> those >>> > are >>> > > > executed each time during parsing of these DAG files. >>> > > > One common example is excessive usage of "Variables.get" as >>> top-level >>> > > > statements in DAG files. >>> > > > >>> > > > Having information about "number of queries to Airflow database" >>> per >>> > DAG >>> > > > file may help a lot during debugging issues related to high load on >>> > > > database or issues related to long parsing of the DAG files. >>> > > > >>> > > > One caveat is that due to e.g. caching enabled for Variables or >>> because >>> > > of >>> > > > other reasons (dynamic DAGs), number of queries may be very >>> different >>> > for >>> > > > each parsing of the DAG file, >>> > > > but at least we can have it as "Last Run Number of Queries" - that >>> > would >>> > > > already give some idea and engineer can also review logs >>> historically >>> > to >>> > > > see its data in the past. >>> > > > >>> > > > What are your thoughts? >>> > > > >>> > > > -- >>> > > > Eugene >>> > > > >>> > > >>> > >>> > >>> > -- >>> > Eugene >>> > >>> >> > > -- > Eugene >
