Re: [DISCUSS] AIP-12 Persist DAG into DB

Maxime Beauchemin Sat, 09 Mar 2019 10:07:21 -0800

I want to raise the question of the amount of normalization we want to use
here as it seems the to be an area that needs more attention.


The SIP suggest having DAG blobs, task blobs and edges (call it the
fairly-normalized). I also like the idea of all-encompassing (call it
very-denormalized) DAG blobs as it seems easier to manage in terms of
versioning. The question here is whether we go with one of these method
exclusively, something in-between or even a hybrid approach (redundant
blobs that use different level of normalization).

It's nice and simple to just push or pull DAG atomic objects with a version
stamp on it. It's clearly simpler than dealing with 3 versioned tables
(dag, tasks, edges). There are a lot of pros/cons, and they become more
apparent with the perspective of very large DAGs. If the web server is
building a "task details page", using the "fairly-normalized" model, it can
just pull what it needs instead of pulling the large DAG blob. Similarly,
if building a sub-tree view (a subset of the DAG), perhaps it can only
retrieve what it needs. But if you need the whole DAG (say for the
scheduler use case) then you're dealing with more complex SQL/ORM
operations (joins hopefully, or multiple db round trips)

Now maybe the right approach is more something like 2 tables: DAG and
task_details, where edges keys are denormalized into DAG (arguably that's a
few KBs at most, even for large DAGs), and maybe the DAG object has most of
the high level task metadata information (operator, name, baseoperator key
attrs), and task_details has the big blobs (SQL code). This is probably a
nice compromise, the question becomes "how much task-level detail do we
store in the DAG-centric blog?", probably not much to keep the DAG objects
as small as possible. The main downside here is that you cannot have the
database join and have to do 2 round trips to reconstruct a DAG object
(fetch the DAG, parse the object to get the list of tasks, and then run
another db query to get those task details).

To resume, I'd qualify the more normalized approach as the most proper, but
also the more complex. It'll shine in specific cases around large DAGs. If
we have the proper abstractions (methods like DAG.serialize(),
DAG.deser(version)) then I guess that's not an issue.

Max

On Fri, Mar 8, 2019 at 5:21 PM Kevin Yang <[email protected]> wrote:

> Hi Julian, I'm definitely aligned with you guys on making the webserver
> independent of DAG parsing, just the end goal to me would be to build a
> complete story around serializing DAG--and move with the story in mind. I
> feel like you guys may already have a list of dynamic features we need to
> deprecate/change, if that is the case feel free to open the discussion on
> what we do to them with DAG serialization.
>
> Julian, Ash, Dan, on 2nd thought I do agree that if we can meet the
> requirements Dan mentioned, it would be nice to have them stored in the DB.
> Some combined solutions like having a column of serialized graph in the
> serialized dag table can potentially meet all requirements. What format we
> end up using to represent DAG between components is now less important
> IMO--fine to refactor those endpoints only need DagModel to use only
> DagModel, easy to do a batch replacement if we decide otherwise later. More
> important is to define this source of truth for serialized DAG.
>
> Ash, ty for the email list, I'll tune my filters accordingly :D I'm leaning
> towards having a separate process for the parser so we got no scheduler
> dependency etc for this parser but we can discuss this in another thread.
>
> On Fri, Mar 8, 2019 at 8:57 AM Dan Davydov <[email protected]>
> wrote:
>
> > >
> > > Personally I don’t understand why people are pushing for a JSON-based
> DAG
> > > representation
> >
> > It sounds like you agree that DAGs should be serialized (just in the DB
> > instead of JSON), so will only address why JSON is better than MySQL (AKA
> > serializing at the DAG level vs the task level) as far as I can see, and
> > not why we need serialization. If you zoom out and look at all the use
> > cases of serialized DAGs, e.g. having the scheduler use them instead of
> > parsing DAGs directly, then it becomes clear that we need all appropriate
> > metadata in these DAGs, (operator params, DAG properties, etc), in which
> > case it's not clear how it will fit nicely into a DB table (unless you
> > wanted to do something like (parent_task_id, task_id, task_params), also
> > keep in mind that we will need to store different versions of each DAG in
> > the future so that we can ensure consistency in a dagrun, i.e. each task
> in
> > a dagrun uses the same version of a DAG.
> >
> > I think some of our requirements should be:
> > 1. The data model will lead to acceptable performance in all of its
> > consumers (scheduler, webserver, workers), i.e. no n+1 access patterns
> (my
> > biggest concern about serializing at task level as you propose vs at DAG
> > level)
> > 2. We can have versioning of serialized DAGs
> > 3. The ability to separate DAGs into their own data store (e.g. no
> reliance
> > on joins between the new table and the old one)
> > 4. One source of truth/serialized representation for DAGs (currently we
> > have SimpleDAG)
> >
> > If we can full-fill all of these requirements and serialize at the task
> > level rather than the DAG level in the DB, then I agree that probably
> makes
> > more sense.
> >
> >
> > > In the proposed PR’s we (Peter, Bas and me) aim to avoid re-parsing DAG
> > > files by querying all the required information from the database. In
> one
> > or
> > > two cases this may however not be possible, in which case we might
> either
> > > have to fall back on the DAG file or add the missing information into
> the
> > > database. We can tackle these problems as we encounter them.
> >
> > I think you would have the support of many of committers in removing any
> > use-cases that stand in the way of full serialization, that being said if
> > we need to remove features we need to do this carefully and thoughtfully,
> > and ideally with proposed alternatives/work-arounds to cover the
> removals.
> >
> > The counter argument: this PR removes the need for the confusing
> "Refresh"
> > > button from the UI, and in general you only pay the cost for the
> > expensive
> > > DAGs when you ask about them. (I don't know what/when we call the
> > > /pickle_info endpoint of the top of my head)
> >
> > Probably worth splitting out into a separate thread, but I'm actually not
> > sure the refresh button does anything, I think we should double check...
> I
> > think about 2 years ago there was a commit made that made gunicorn
> > webservers automatically rotate underneath flask (each one would reparse
> > the DAGbag). Even if it works we should probably remove it since the
> > webserver refresh interval is pretty fast, and it just causes confusion
> to
> > users and implies that the DAGs are not refreshed automatically.
> >
> > Do you mean https://json5.org/ or is this a typo? That might be okay
> for a
> > > nicer user front end, but the "canonical" version stored in the DB
> should
> > > be something "plainer" like just JSON.
> >
> > I think he got this from my reply, and it was just an example, but you
> are
> > right, I agree JSON would be better than JSON5.
> >
> > On Fri, Mar 8, 2019 at 8:53 AM Ash Berlin-Taylor <[email protected]> wrote:
> >
> > > Comments inline.
> > >
> > > > On 8 Mar 2019, at 11:28, Kevin Yang <[email protected]> wrote:
> > > >
> > > > Hi all,
> > > > When I was preparing some work related to this AIP I found something
> > > very concerning. I noticed this JIRA ticket <
> > > https://issues.apache.org/jira/browse/AIRFLOW-3562> is trying to
> remove
> > > the dependency of dagbag from webserver, which is awesome--we wanted
> > badly
> > > but never got to start work on. However when I looked at some subtasks
> of
> > > it, which try to remove dagbag dependency from each endpoint, I found
> the
> > > way we remove the dependency of dagbag is not very ideal. For example
> > this
> > > PR <https://github.com/apache/airflow/pull/4867/files> will require us
> > to
> > > parse the dag file each time we hit the endpoint.
> > >
> > > The counter argument: this PR removes the need for the confusing
> > "Refresh"
> > > button from the UI, and in general you only pay the cost for the
> > expensive
> > > DAGs when you ask about them. (I don't know what/when we call the
> > > /pickle_info endpoint of the top of my head)
> > >
> > > This end point may be one to hold off on (as it can ask for multiple
> > dags)
> > > but there are some that def don't need a full dag bag or to even parse
> > the
> > > dag file, the current DAG model has enough info.
> > >
> > > >
> > > >
> > > > If we go down this path, we indeed can get rid of the dagbag
> dependency
> > > easily, but we will have to 1. increase the DB load( not too concerning
> > at
> > > the moment ), 2. wait the DAG file to be parsed before getting the page
> > > back, potentially multiple times. DAG file can sometimes take quite a
> > while
> > > to parse, e.g. we have some framework DAG files generating large number
> > of
> > > DAGs from some static config files or even jupyter notebooks and they
> can
> > > take 30+ seconds to parse. Yes we don't like large DAG files but people
> > do
> > > see the beauty of code as config and sometimes heavily abuseleverage
> it.
> > > Assuming all users have the same nice small python file that can be
> > parsed
> > > fast, I'm still a bit worried about this approach. Continuing on this
> > path
> > > means we've chosen DagModel to be the serialized representation of DAG
> > and
> > > DB columns to hold different properties--it can be one candidate but I
> > > don't know if we should settle on that now. I would personally prefer a
> > > more compact, e.g. JSON5, and easy to scale representation( such that
> > > serializing new fields != DB upgrade).
> > >
> > > Do you mean https://json5.org/ or is this a typo? That might be okay
> for
> > > a nicer user front end, but the "canonical" version stored in the DB
> > should
> > > be something "plainer" like just JSON.
> > >
> > > I'm not sure that "serializing new fields != DB upgrade" is that big
> of a
> > > concern, as we don't add fields that often. One possible way of dealing
> > > with it if we do is to have a hybrid approach - a few distinct columns,
> > but
> > > then a JSON blob. (and if we were only to support postgres we could
> just
> > > use JSONb. But I think our friends at Google may object ;) )
> > >
> > > Adding a new column in a DB migration with a default NULL shouldn't be
> an
> > > expensive operation, or difficult to achieve.
> > >
> > >
> > > >
> > > > In my imagination we would have to collect the list of dynamic
> features
> > > depending on unserializable fields of a DAG and start a discussion/vote
> > on
> > > dropping support of them( I'm working on this but if anyone has already
> > > done so please take over), decide on the serialized representation of a
> > DAG
> > > and then replace dagbag with it in webserver. Per previous discussion
> and
> > > some offline discussions with Dan, one future of DAG serialization
> that I
> > > like would look similar to this:
> > > >
> > >
> > > > https://imgur.com/ncqqQgc
> > >
> > > Something I've thought about before for other things was to embed an
> API
> > > server _into_ the scheduler - this would be useful for k8s
> healthchecks,
> > > native Prometheus metrics without needed statsd bridge, and could have
> > > endpoints to get information such as this directly.
> > >
> > > I was thinking it would be _in_ the scheduler process using either
> > threads
> > > (ick. Python's still got a GIL doesn't it?) or using async/twisted etc.
> > > (not a side-car process like we have with the logs webserver for
> `airflow
> > > worker`).
> > >
> > > (This is possibly an unrelated discussion, but might be worth talking
> > > about?)
> > >
> > > > We can still discuss/vote which approach we want to take but I don't
> > > want the door to above design to be shut right now or we have to spend
> a
> > > lot effort switch path later.
> > > >
> > > > Bas and Peter, I'm very sorry to extend the discussion but I do think
> > > this is tightly related to the AIP and PRs behind it. And my sincere
> > > apology for bringing this up so late( I only pull the open PR list
> > > occasionally, if there's a way to subscribe to new PR event I'd love to
> > > know how).
> > >
> > > It's noisy, but you can subscribe to [email protected] (but
> be
> > > warned, this also includes all Jira tickets, edits of every comment on
> > > github etc.).
> > >
> > >
> > > >
> > > > Cheers,
> > > > Kevin Y
> > > >
> > > > On Thu, Feb 28, 2019 at 1:36 PM Peter van t Hof <
> [email protected]
> > > <mailto:[email protected]>> wrote:
> > > > Hi all,
> > > >
> > > > Just some comments one the point Bolke dit give in relation of my PR.
> > > >
> > > > At first, the main focus is: making the webserver stateless.
> > > >
> > > > > 1) Make the webserver stateless: needs the graph of the *current*
> dag
> > > >
> > > > This is the main goal but for this a lot more PR’s will be coming
> once
> > > my current is merged. For edges and graph view this is covered in my PR
> > > already.
> > > >
> > > > > 2) Version dags: for consistency mainly and not requiring parsing
> of
> > > the
> > > > > dag on every loop
> > > >
> > > > In my PR the historical graphs will be stored for each DagRun. This
> > > means that you can see if an older DagRun was the same graph structure,
> > > even if some tasks does not exists anymore in the current graph.
> > Especially
> > > for dynamic DAG’s this is very useful.
> > > >
> > > > > 3) Make the scheduler not require DAG files. This could be done if
> > the
> > > > > edges contain all information when to trigger the next task. We can
> > > then
> > > > > have event driven dag parsing outside of the scheduler loop, ie. by
> > the
> > > > > cli. Storage can also be somewhere else (git, artifactory,
> > filesystem,
> > > > > whatever).
> > > >
> > > > The scheduler is almost untouched in this PR. The only thing that is
> > > added is that this edges are saved to the database but the scheduling
> > > itself din’t change. The scheduler depends now still on the DAG object.
> > > >
> > > > > 4) Fully serialise the dag so it becomes transferable to workers
> > > >
> > > > It nice to see that people has a lot of idea’s about this. But as
> Fokko
> > > already mentioned this is out of scope for the issue what we are trying
> > to
> > > solve. I also have some idea’s about this but I like to limit this
> PR/AIP
> > > to the webserver.
> > > >
> > > > For now my PR does solve 1 and 2 and the rest of the behaviour (like
> > > scheduling) is untouched.
> > > >
> > > > Gr,
> > > > Peter
> > > >
> > >
> > >
> >
>

Re: [DISCUSS] AIP-12 Persist DAG into DB

Reply via email to