john-bodley opened a new issue, #25107: URL: https://github.com/apache/superset/issues/25107
## [SIP-98A] Primer on managing SQLAlchemy sessions This SIP is part of the [[SIP-98] Proposal for correctly handling business logic](https://github.com/apache/superset/issues/25048) series. Specifically, it serves as a primer—reiterating best practices—for working with SQLAlchemy sessions as these were typically mismanaged. The topics outlined here are highly coupled with [[SIP-98B] Proposal for (re)defining a “unit of work”](...) as they both relate to managing database transactions. ### Session Management The [SQLAlchemy session](https://docs.sqlalchemy.org/en/20/orm/session_basics.html)—which establishes all conversations with the database—is managed by [Flask-SQLAlchemy](https://github.com/pallets-eco/flask-sqlalchemy) which also handles cleaning up connections and sessions after each [request](https://github.com/pallets-eco/flask-sqlalchemy/blob/564a96f2ee07819aac2dcca7d9eb58dcfeb1765f/src/flask_sqlalchemy/extension.py#L379-L386). The preconfigured scoped session (called `db.session`) should be used for the lifetime of a request to ensure there is only one “virtual” transaction—guaranteeing that the “unit of work” is indeed atomic ([SIP-98B](...)). The SQLAlchemy [Managing Transactions](https://docs.sqlalchemy.org/en/20/orm/session_transaction.html#managing-transactions) documentation states: > The Session tracks the state of a single “virtual" transaction at a time. This “virtual" transaction is created automatically when needed, or can alternatively be started using the `Session.begin()` method. `Session.commit()` ends the transaction. The transaction is typically framed as follows, ```python try: db.session.add(Dashboard(slug="foo")) db.session.commit() except SQLAlchemyError: db.session.rollback() raise ``` though, via leveraging a context manager, the `Session.begin()` method serves as a demarcated transaction and provides the same sequence of operations: ```python with db.session.begin(): db.session.add(Dashboard(slug="foo")) ``` #### Session States There are [five states](https://docs.sqlalchemy.org/en/20/orm/session_state_management.html) which an object can have within a session: - Transient - Pending - Persisted - Deleted - Detached The session tracks objects that are either new, dirty, or deleted. The [`Session.flush()`](https://docs.sqlalchemy.org/en/20/orm/session_api.html#sqlalchemy.orm.Session.flush) method communicates a series of operations to the database (`INSERT`, `UPDATE`, `DELETE`). The database maintains them as pending operations in a transaction. The changes are not persisted permanently to disk, or visible to other transactions until the database receives a `COMMIT` statement via the [`Session.commit()`](https://docs.sqlalchemy.org/en/20/orm/session_api.html#sqlalchemy.orm.Session.commit) method—thus ending the transaction. If `Session.commit()` is called, then `Session.flush()` is called automatically. The [`Session.merge()`](https://docs.sqlalchemy.org/en/20/orm/session_api.html#sqlalchemy.orm.Session.merge) convenience method copies the state of a given instance into a corresponding instance within the session. > `Session.merge()` examines the primary key attributes of the source instance, and attempts to reconcile it with an instance of the same primary key in the session. If not found locally, it attempts to load the object from the database based on the primary key, and if none can be located, creates a new instance. The state of each attribute on the source instance is then copied to the target instance. Without a primary key `Session.merge()` is akin to `Session.add()`, ```python db.session.merge(Dashboard(slug="foo")) db.session.flush() # INSERT INTO dashboards ... ``` ```python db.session.merge(Dashboard(id=1, slug="foo")) # SELECT * FROM dashboards WHERE id = ? db.session.flush() # UPDATE dashboards SET ... ``` Note the use of the `Session.merge()` method is often misinterpreted, i.e., a persistent object does not need to be (re)merged. If the object is dirty the pending changes will be persisted during a subsequent flush. #### Transient An object that is not in the session. ```python >>> dashboard = Dashboard(slug="foo") >>> inspect(dashboard).transient True >>> db.session.new IdentitySet([]) >>> db.session.add(dashboard) >>> db.session.new IdentitySet([Dashboard<foo>]) >>> inspect(dashboard).transient False >>> db.session.remove() ``` #### Pending An object which has been added to the session but has not yet been flushed to the database. ```python >>> dashboard = Dashboard(slug="foo") >>> db.session.add(dashboard) >>> inspect(dashboard).pending True >>> db.session.new IdentitySet([Dashboard<foo>]) >>> db.session.flush() # INSERT INTO dashboards ... >>> db.session.new IdentitySet([]) >>> inspect(dashboard).pending False >>> db.session.remove() ``` Note that constraints will only be checked when the object is flushed. ```python >>> db.session.add_all([Dashboard(slug="bar"), Dashboard(slug="bar")]) >>> db.session.flush() # INSERT INTO dashboards ... sqlalchemy.exc.IntegrityError: (sqlite3.IntegrityError) UNIQUE constraint failed: dashboards.slug >>> db.session.remove() ``` #### Persistent An object which is present in the session and has a corresponding record in the database. The persisted state occurs either during [flushing](https://docs.sqlalchemy.org/en/20/orm/session_basics.html#flushing) (which is invoked by `Session.merge()`) or querying. ```python >>> dashboard = Dashboard(slug="foo") >>> db.session.add(dashboard) >>> db.session.flush() # INSERT INTO dashboards ... >>> inspect(dashboard).persistent True >>> db.session.dirty IdentitySet([]) >>> dashboard.slug = "bar" >>> db.session.dirty IdentitySet([Dashboard<1>]) >>> db.session.flush() # UPDATE dashboards SET ... >>> db.session.dirty IdentitySet([]) >>> db.session.remove() ``` It is worth noting that objects within the session, in either pending* or persisted state, can be queried albeit not having been committed to the database. We have a tendency to over commit (be that in the code or tests) resulting in fractured/partial atomic units. Typically working with objects in a persisted state is sufficient. ```python >>> dashboard = Dashboard(slug="foo") >>> db.session.add(dashboard) >>> inspect(dashboard).pending True >>> dashboard is db.session.query(Dashboard).filter(Dashboard.slug == "foo").one() >>> True >>> inspect(dashboard).persistent True >>> db.session.remove() ``` \* The `Session.query()` method invoked a `Session.flush()` which meant that the pending object became persisted. #### Deleted An object which has been deleted within a flush, but the transaction has not yet completed. ```python >>> dashboard = Dashboard(slug="foo") >>> db.session.add(dashboard) >>> db.session.flush() # INSERT INTO dashboards ... >>> inspect(dashboard).deleted False >>> db.session.deleted IdentitySet([]) >>> db.session.delete(dashboard) >>> db.session.deleted IdentitySet([Dashboard<foo>]) >>> db.session.flush() # DELETE FROM dashboards WHERE id = ? >>> inspect(dashboard).deleted True >>> db.session.remove() ``` #### Detached An object that corresponds (or previously corresponded) to a record in the database, but is not currently in any session. ```python >>> dashboard = Dashboard(slug="foo") >>> db.session.add(dashboard) >>> db.session.flush() # INSERT INTO dashboards ... >>> inspect(dashboard).detached >>> False >>> db.session.remove() >>> inspect(dashboard).detached True ``` #### Examples The following examples illustrate where the session was mismanaged: - The [`TableSchemaView.expanded()`](https://github.com/apache/superset/blob/3630d6844c0f4668f7196beadd744e582c9219bd/superset/views/sql_lab/views.py#L217-L228) method unnecessarily commits even though the function only fetches, i.e., there are no new, dirty, or deleted objects in the session. - The [`UpdateKeyValueCommand.update()`](https://github.com/apache/superset/blob/f584c8462b6d72f902a59e5f72a0f4603de4da1a/superset/key_value/commands/update.py#L77-L92) method unnecessarily merges the dirty `event` object even though it was already persisted—per the `Session.query()` call—in the session. - The [test_get_expired_entry](https://github.com/apache/superset/blob/f584c8462b6d72f902a59e5f72a0f4603de4da1a/tests/integration_tests/key_value/commands/get_test.py#L66-L82) test unnecessarily commits the `entry` object to the database given that the [`GetKeyValueCommand.run()`](https://github.com/apache/superset/blob/f584c8462b6d72f902a59e5f72a0f4603de4da1a/superset/key_value/commands/get.py#L58-L62) method leverages the same Flask-SQLAlchemy session. This requires additional work to then delete the object from the database to ensure that the test remains idempotent. ### Proposed Change None beyond adhering to best practices when managing SQLAlchemy sessions—via the Flask-SQLAlchemy singleton scoped session. ### New or Changed Public Interfaces None. ### New Dependencies None. ### Migration Plan and Compatibility - A wiki page will be created to specify the rules which should be adhered to. - A series of PRs to address existing rule violations. - New PRs (where possible) should adhere to the rules. None. ### Rejected Alternatives None. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
