john-bodley opened a new issue, #25048:
URL: https://github.com/apache/superset/issues/25048

   *Please make sure you are familiar with the SIP process documented*
   (here)[https://github.com/apache/superset/issues/5602]. The SIP will be 
numbered by a committer upon acceptance.
   
   ## [SIP] Proposal for managing SQLAlchemy sessions
   
   ### Motivation
   
   Whilst working on refactoring the DAOs for [[SIP-92] Proposal for 
restructuring the Python code 
base](https://github.com/apache/superset/issues/20630) it became apparent that 
our grasp on: 
   
   - [SQLAlchemy sessions](https://docs.sqlalchemy.org/en/20/orm/session.html), 
and
   - [The unit of 
work](https://medium.com/craftsmenltd/sqlalchemy-and-the-unit-of-work-pattern-dfa91a098023)—a
 central part of SQLAlchemy's Object Relational Mapper (ORM)
   
   was likely incorrect or missoncruded resulting in overly complex DAOs, 
models, event listeners, unit/integration tests, etc.
   
   ### Primer 
   
   This SIP proposes a number of changes to address these concerns. I sense 
there's merit is starting off with a short primer on how SQLAlchemy sessions 
are defined which will serve as a precursor  to discussing the "unit of work" 
topic.
   
   #### SQLAlchemy sessions
   
   There are [five 
states](https://docs.sqlalchemy.org/en/20/orm/session_state_management.html) 
which an instance can have within a session:
   
   - Transient
   - Pending
   - Persisted*
   - Deleted
   - Detached
   
   \* It's worth noting that the persisted state occurs either during 
[flushing](https://docs.sqlalchemy.org/en/20/orm/session_basics.html#flushing) 
(which is invoked by `Session.merge()`) or querying. Additionally the 
flush—which is typically done transparently—is when the various constraints, 
foreign key checks, etc. are invoked.
   
   The flush communicates a series of operations to the database (`INSERT`, 
`UPDATE`, `DELETE`). The database maintains them as pending operations in a 
transaction. The changes aren't persisted permanently to disk, or visible to 
other transactions until the database receives a `COMMIT`.
   
   Here's an example of which hopefully illustrates these various states:
   
   ```python
   >>> from sqlalchemy import inspect
   >>>
   >>> from superset import db
   >>> from superset.models.slice import Slice
   
   >>> slc = Slice(slice_name="foo", datasource_type="table", datasource_id=1)
   >>> slc in db.session
   False
   >>> inspect(slc).transient
   >>> True
   >>> db.session.add(slc)
   >>> slc in db.session
   True
   >>> inspect(slc).pending
   True
   >>> slc is db.session.query(Slice).filter(Slice.slice_name == "bar").one()
   True
   >>> db.session.flush()  # INSERT INTO slices ...
   >>> inspect(slc).persistent
   True
   
   >>> slc = db.session.query(Slice).get(2)
   >>> slc in db.session
   True
   >>> inspect(slc).persistent
   True
   >>> db.session.dirty
   IdentitySet([])
   >>> slc.slice_name = "foo"  
   >>> db.session.dirty
   IdentitySet([foo])
   >>> db.session.flush()  # UPDATE slices SET ...
   
   >>> slc = db.session.query(Slice).get(2)
   >>> inspect(slc).persistent
   True
   >>> db.session.delete(slc)
   >>> db.session.flush()  # DELETE FROM slices WHERE slices.id = ?
   >>> inspect(slc).deleted
   True
   >>> db.session.commit()
   >>> inspect(slc).detached
   True
   ```
   
   Note it's not atypically to see the following patterns in the code base:
   
   ```python
   from superset import db
   from superset.models.slice import Slice
   
   slc = db.session.query(Slice).get(2)
   slc.slice_name = "foo"
   db.session.merge(slc)
   db.session.commit()
   ```
   
   The `merge` step is unnecessary given that the `slc` object is already 
persisted. The SQLAlchemy session is aware that the session is dirty and upon 
flushing will invoke the SQL statement within the session's [single virtual 
transaction](https://docs.sqlalchemy.org/en/20/orm/session_transaction.html#managing-transactions).
 
   
   #### Unit of work
   
   This provides a good segue into the concept of "unit of work" and our 
tendency to over commit—which is a typical SQLAlchemy mistake. Per the 
[SQLAlchemy documentation](https://www.sqlalchemy.org/features.html):
   
   > The Unit Of Work system, a central part of SQLAlchemy's Object Relational 
Mapper (ORM), organizes pending `INSERT`/`UPDATE`/`DELETE` operations into 
queues and flushes them all in one batch.
   
   This means that we should organize our various database operations into a 
single unit (business transaction) which is either:
   
   - committed (if successful, or 
   - rolled back (if unsuccessful)
   
   If one has a tendency to over commit (be that via SQLAlchemy event 
listeners, unit/integration tests, etc.) it prevents the ability to rollback to 
the previous state prior to the unit of work—hence why tests typically 
unnecessarily commit objects (whereas they could have happily remained 
persisted in either the session or transaction for testing purposes—and then 
rolled back) which then need to be explicitly deleted.
   
   In the context of Flask a typical "unit of work" is a request, however 
Superset has a CLI and thus not all operations occurs via Flask request. Per 
[[SIP-35] Proposal for Improving Superset’s Python Code 
Organization](https://github.com/apache/superset/issues/9077) the DAO has been 
the de facto "unit of work" which in actuality is only a sub-unit of work, 
i.e., 
   
   ```python
   # superset/daos.base.py
   class BaseDAO:
       def update(cls, item: Model, attributes: dict[str, Any], commit: bool = 
True) -> T:    
           for key, value in attrbutes.items():
               setattr(item key, value)
   
           try:
               db.session.merge(item)
   
               if commit:
                    db.session.commit()
           except SQLAlchemyError as ex:
               db.session.rollback()
               raise DAOUpdateFailedError(exception=ex) from ex
   ``` 
   
   To circumvent this problem many DAOs provide a `commit = True | False` 
option to combine multiple steps—such as creating a dataset then populating the 
columns via fetching the associated metadata from the data warehouse—into a 
single commit, however it becomes problematic when trying to build/maintain a 
mental model of the system with complex dependencies. 
   
   Additionally we often leverage various SQLAlchemy events—`after_insert`, 
`before_delete`, etc.—which tend to have their own session/transaction 
resulting in a more fragmented unit.
   
   Finally there are [numerous 
suggestions](https://stackoverflow.com/questions/773393/dao-pattern-where-do-transactions-fit-in)
 that one should 
   
   > ... decouple your business layer from any DAO technologies.
   
   or 
   
   >   ... keep the control of the transaction in the business layer because 
transaction is a business problem.
   
   thus DAOs really are not the right construct to commit (permanently save) 
the changes. 
   
   ### Proposed Change
   
   As mentioned previously Superset has both an RESTful HTTP API and a CLI and 
thus the the commands (as opposed to the DAOs) best serves as our unit of work. 
Not all API calls invoke a command and thus the API should serve as the de 
facto unit of work if no corresponding command exist.
   
   Using a command (as opposed to a DAO) as the unit of work does not mitigate 
the over commit problem completely as it could be conservable where a business 
transaction needs to invoke multiple commands. To address this—without over 
burdening the developer to mentally track the session state we recommend to 
leverage SQLAlchemy's [nested 
session](https://docs.sqlalchemy.org/en/20/orm/session_transaction.html#using-savepoint)
 given that MySQL, PostreSQL, and SQLite all support the `SAVEPOINT` construct. 
   
   The premise is (by leveraging a context manager) these sessions can be 
nested where the outermost session constructs a transaction which is then 
inherited by the inner nested sessions. A session can be rolled back to 
the`SAVEPOINT` whilst preserving the previous state. Typically this isn't a 
behavior we'll likely adopt given that we would prefer any uncaught 
`SQLAlchemyError` exception to percolate back to the outermost nested session 
which will then be rolled back. The TL;DR is by nesting nested sessions we are 
able to ensure:
   
   1. There are only two states: the entire unit of work is i) committed, or 
ii) rolled back.
   2. We can be completely agnostic when authoring the code at what layer the 
nested session is. In some contexts the same nested session could be the 
outermost whereas in others it is the innermost. All we have to ensure is that 
we never explicitly commit or rollback the session—both of which are handled by 
the context manager. 
   
   Here's how this pattern will materialize:
   
   ```python
   # superset/daos/base.py
   class BaseDAO:
       def update(cls, item: Model, attributes: dict[str, Any]) -> T:    
           for key, value in attrbutes.items():
               setattr(item key, value)
       
           db.session.merge(item)
   ```
   
   and
   
   ```python
   # superset/dashboards/commands/update.py
   class UpdateDashboardCommand:
       def run(self) -> Dashboard:
           try:
               with db.session.begin_nested():
                   dashboard = DashboardDAO.update(self._model, 
self._properties)
                   ChartDAO.update_owners(dashboard)
           except SQLAlchemyError:
               raise DashboardUpdateFailedError() from ex 
   ```
     
   The TL;DR is:
   
   - A command should be thought as of the canonical "unit of work".
   - A DAO (and their sub-components—including event listeners) should not 
define a new session or transaction, or commit/rollback an transaction.
   - `Session.merge()` should only be invoked sparingly.
   - Tests should leverage the session and transaction state—flushing and 
rolling back as opposed to committing. 
   
   ### New or Changed Public Interfaces
   
   None.
   
   ### New dependencies
   
   None.
   
   ### Migration Plan and Compatibility
   
   N/A.
   
   ### Rejected Alternatives
   
   None.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to