Yep . Per DAG file is what I actually meant :)

On Fri, Jun 14, 2024 at 12:26 PM Eugen Kosteev <[email protected]> wrote:

> The thing is that it is "last count per DAG file".
> I do not think we can actually calculate this per DAG, well we can split
> total number of queries by number of DAGs in the file, but this maybe
> confusing.
>
> On Fri, Jun 14, 2024 at 12:24 PM Jarek Potiuk <[email protected]> wrote:
>
>> >  the cardinality of those logs is too high.
>>
>> I was thinking about only showing "last count per DAG" - then cardinality
>> would be "good enough" I think. It could also be exposed via metrics now
>> that I think of it - no real need to see it in UI or API.
>>
>> On Fri, Jun 14, 2024 at 12:14 PM Kaxil Naik <[email protected]> wrote:
>>
>>> Yeah, valuable to show it in logs. For showing it in a web server or
>>> storing it in DB, the cardinality of those logs is too high.
>>>
>>> On Fri, 14 Jun 2024 at 11:09, Eugen Kosteev <[email protected]> wrote:
>>>
>>> > Yeah, I also think it is a good idea to expose it in the Airflow UI.
>>> >
>>> > Although, atm we do not have an entity such as DAG file (and this
>>> metric is
>>> > per DAG file) in Airflow database, so we would need to design it a
>>> little
>>> > bit.
>>> > And attaching to the DAG model is not correct.
>>> >
>>> > But I totally agree, it would be good to have it in Airflow UI as well
>>> for
>>> > "operation users" to have access to this information.
>>> >
>>> > On Fri, Jun 14, 2024 at 11:22 AM Jarek Potiuk <[email protected]>
>>> wrote:
>>> >
>>> > > Good idea, it would also be good if we could have access to the
>>> > information
>>> > > exposed in the UI - so that "operations users" can see it and maybe
>>> even
>>> > > act on it + API/ CLI to check it. I think in the future of Airflow 3
>>> > where
>>> > > we will have task isolation, having `0` for all the DAGs will be a
>>> > > prerequisite for switching to "task isolation" mode and this could be
>>> > > actually verified in a migration tool.
>>> > >
>>> > > On Fri, Jun 14, 2024 at 10:59 AM Eugen Kosteev <[email protected]>
>>> > wrote:
>>> > >
>>> > > > Hi.
>>> > > >
>>> > > > I would like to discuss the proposal of adding a new column to the
>>> "DAG
>>> > > > File Processing Stats" of DAG processor logs.
>>> > > >
>>> > > > Currently in the logs of DAG processor, there is following data
>>> > > > (screenshot below) that includes # of DAGs, runtime, etc. per DAG
>>> file.
>>> > > > [image: image.png]
>>> > > >
>>> > > > It seems that it would be beneficial to have also there data about
>>> the
>>> > > > number of queries performed to the Airflow database during parsing
>>> of
>>> > > each
>>> > > > file.
>>> > > > It maybe convenient to have it in case of debugging issues related
>>> to
>>> > > high
>>> > > > load on Airflow database, e.g. typical scenario is when DAG file(s)
>>> > have
>>> > > > a lot of queries to database done on the top level of code and
>>> those
>>> > are
>>> > > > executed each time during parsing of these DAG files.
>>> > > > One common example is excessive usage of "Variables.get" as
>>> top-level
>>> > > > statements in DAG files.
>>> > > >
>>> > > > Having information about "number of queries to Airflow database"
>>> per
>>> > DAG
>>> > > > file may help a lot during debugging issues related to high load on
>>> > > > database or issues related to long parsing of the DAG files.
>>> > > >
>>> > > > One caveat is that due to e.g. caching enabled for Variables or
>>> because
>>> > > of
>>> > > > other reasons (dynamic DAGs), number of queries may be very
>>> different
>>> > for
>>> > > > each parsing of the DAG file,
>>> > > > but at least we can have it as "Last Run Number of Queries" - that
>>> > would
>>> > > > already give some idea and engineer can also review logs
>>> historically
>>> > to
>>> > > > see its data in the past.
>>> > > >
>>> > > > What are your thoughts?
>>> > > >
>>> > > > --
>>> > > > Eugene
>>> > > >
>>> > >
>>> >
>>> >
>>> > --
>>> > Eugene
>>> >
>>>
>>
>
> --
> Eugene
>

Reply via email to