john-bodley opened a new issue, #25048: URL: https://github.com/apache/superset/issues/25048
*Please make sure you are familiar with the SIP process documented* (here)[https://github.com/apache/superset/issues/5602]. The SIP will be numbered by a committer upon acceptance. ## [SIP] Proposal for managing SQLAlchemy sessions ### Motivation Whilst working on refactoring the DAOs for [[SIP-92] Proposal for restructuring the Python code base](https://github.com/apache/superset/issues/20630) it became apparent that our grasp on: - [SQLAlchemy sessions](https://docs.sqlalchemy.org/en/20/orm/session.html), and - [The unit of work](https://medium.com/craftsmenltd/sqlalchemy-and-the-unit-of-work-pattern-dfa91a098023)—a central part of SQLAlchemy's Object Relational Mapper (ORM) was likely incorrect or missoncruded resulting in overly complex DAOs, models, event listeners, unit/integration tests, etc. ### Primer This SIP proposes a number of changes to address these concerns. I sense there's merit is starting off with a short primer on how SQLAlchemy sessions are defined which will serve as a precursor to discussing the "unit of work" topic. #### SQLAlchemy sessions There are [five states](https://docs.sqlalchemy.org/en/20/orm/session_state_management.html) which an instance can have within a session: - Transient - Pending - Persisted* - Deleted - Detached \* It's worth noting that the persisted state occurs either during [flushing](https://docs.sqlalchemy.org/en/20/orm/session_basics.html#flushing) (which is invoked by `Session.merge()`) or querying. Additionally the flush—which is typically done transparently—is when the various constraints, foreign key checks, etc. are invoked. The flush communicates a series of operations to the database (`INSERT`, `UPDATE`, `DELETE`). The database maintains them as pending operations in a transaction. The changes aren't persisted permanently to disk, or visible to other transactions until the database receives a `COMMIT`. Here's an example of which hopefully illustrates these various states: ```python >>> from sqlalchemy import inspect >>> >>> from superset import db >>> from superset.models.slice import Slice >>> slc = Slice(slice_name="foo", datasource_type="table", datasource_id=1) >>> slc in db.session False >>> inspect(slc).transient >>> True >>> db.session.add(slc) >>> slc in db.session True >>> inspect(slc).pending True >>> slc is db.session.query(Slice).filter(Slice.slice_name == "bar").one() True >>> db.session.flush() # INSERT INTO slices ... >>> inspect(slc).persistent True >>> slc = db.session.query(Slice).get(2) >>> slc in db.session True >>> inspect(slc).persistent True >>> db.session.dirty IdentitySet([]) >>> slc.slice_name = "foo" >>> db.session.dirty IdentitySet([foo]) >>> db.session.flush() # UPDATE slices SET ... >>> slc = db.session.query(Slice).get(2) >>> inspect(slc).persistent True >>> db.session.delete(slc) >>> db.session.flush() # DELETE FROM slices WHERE slices.id = ? >>> inspect(slc).deleted True >>> db.session.commit() >>> inspect(slc).detached True ``` Note it's not atypically to see the following patterns in the code base: ```python from superset import db from superset.models.slice import Slice slc = db.session.query(Slice).get(2) slc.slice_name = "foo" db.session.merge(slc) db.session.commit() ``` The `merge` step is unnecessary given that the `slc` object is already persisted. The SQLAlchemy session is aware that the session is dirty and upon flushing will invoke the SQL statement within the session's [single virtual transaction](https://docs.sqlalchemy.org/en/20/orm/session_transaction.html#managing-transactions). #### Unit of work This provides a good segue into the concept of "unit of work" and our tendency to over commit—which is a typical SQLAlchemy mistake. Per the [SQLAlchemy documentation](https://www.sqlalchemy.org/features.html): > The Unit Of Work system, a central part of SQLAlchemy's Object Relational Mapper (ORM), organizes pending `INSERT`/`UPDATE`/`DELETE` operations into queues and flushes them all in one batch. This means that we should organize our various database operations into a single unit (business transaction) which is either: - committed (if successful, or - rolled back (if unsuccessful) If one has a tendency to over commit (be that via SQLAlchemy event listeners, unit/integration tests, etc.) it prevents the ability to rollback to the previous state prior to the unit of work—hence why tests typically unnecessarily commit objects (whereas they could have happily remained persisted in either the session or transaction for testing purposes—and then rolled back) which then need to be explicitly deleted. In the context of Flask a typical "unit of work" is a request, however Superset has a CLI and thus not all operations occurs via Flask request. Per [[SIP-35] Proposal for Improving Superset’s Python Code Organization](https://github.com/apache/superset/issues/9077) the DAO has been the de facto "unit of work" which in actuality is only a sub-unit of work, i.e., ```python # superset/daos.base.py class BaseDAO: def update(cls, item: Model, attributes: dict[str, Any], commit: bool = True) -> T: for key, value in attrbutes.items(): setattr(item key, value) try: db.session.merge(item) if commit: db.session.commit() except SQLAlchemyError as ex: db.session.rollback() raise DAOUpdateFailedError(exception=ex) from ex ``` To circumvent this problem many DAOs provide a `commit = True | False` option to combine multiple steps—such as creating a dataset then populating the columns via fetching the associated metadata from the data warehouse—into a single commit, however it becomes problematic when trying to build/maintain a mental model of the system with complex dependencies. Additionally we often leverage various SQLAlchemy events—`after_insert`, `before_delete`, etc.—which tend to have their own session/transaction resulting in a more fragmented unit. Finally there are [numerous suggestions](https://stackoverflow.com/questions/773393/dao-pattern-where-do-transactions-fit-in) that one should > ... decouple your business layer from any DAO technologies. or > ... keep the control of the transaction in the business layer because transaction is a business problem. thus DAOs really are not the right construct to commit (permanently save) the changes. ### Proposed Change As mentioned previously Superset has both an RESTful HTTP API and a CLI and thus the the commands (as opposed to the DAOs) best serves as our unit of work. Not all API calls invoke a command and thus the API should serve as the de facto unit of work if no corresponding command exist. Using a command (as opposed to a DAO) as the unit of work does not mitigate the over commit problem completely as it could be conservable where a business transaction needs to invoke multiple commands. To address this—without over burdening the developer to mentally track the session state we recommend to leverage SQLAlchemy's [nested session](https://docs.sqlalchemy.org/en/20/orm/session_transaction.html#using-savepoint) given that MySQL, PostreSQL, and SQLite all support the `SAVEPOINT` construct. The premise is (by leveraging a context manager) these sessions can be nested where the outermost session constructs a transaction which is then inherited by the inner nested sessions. A session can be rolled back to the`SAVEPOINT` whilst preserving the previous state. Typically this isn't a behavior we'll likely adopt given that we would prefer any uncaught `SQLAlchemyError` exception to percolate back to the outermost nested session which will then be rolled back. The TL;DR is by nesting nested sessions we are able to ensure: 1. There are only two states: the entire unit of work is i) committed, or ii) rolled back. 2. We can be completely agnostic when authoring the code at what layer the nested session is. In some contexts the same nested session could be the outermost whereas in others it is the innermost. All we have to ensure is that we never explicitly commit or rollback the session—both of which are handled by the context manager. Here's how this pattern will materialize: ```python # superset/daos/base.py class BaseDAO: def update(cls, item: Model, attributes: dict[str, Any]) -> T: for key, value in attrbutes.items(): setattr(item key, value) db.session.merge(item) ``` and ```python # superset/dashboards/commands/update.py class UpdateDashboardCommand: def run(self) -> Dashboard: try: with db.session.begin_nested(): dashboard = DashboardDAO.update(self._model, self._properties) ChartDAO.update_owners(dashboard) except SQLAlchemyError: raise DashboardUpdateFailedError() from ex ``` The TL;DR is: - A command should be thought as of the canonical "unit of work". - A DAO (and their sub-components—including event listeners) should not define a new session or transaction, or commit/rollback an transaction. - `Session.merge()` should only be invoked sparingly. - Tests should leverage the session and transaction state—flushing and rolling back as opposed to committing. ### New or Changed Public Interfaces None. ### New dependencies None. ### Migration Plan and Compatibility N/A. ### Rejected Alternatives None. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
