This is an automated email from the ASF dual-hosted git repository.

potiuk pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/airflow.git


The following commit(s) were added to refs/heads/main by this push:
     new 3bb5978e63 Add Architecture Decision Record for common.sql 
introduction (#36015)
3bb5978e63 is described below

commit 3bb5978e63f3be21a5bb7ae89e7e3ce9d06a4ab8
Author: Jarek Potiuk <[email protected]>
AuthorDate: Wed Dec 6 21:36:51 2023 +0100

    Add Architecture Decision Record for common.sql introduction (#36015)
    
    Fixes: #35874
---
 .../doc/adr/0001-record-architecture-decisions.md  |  48 +++++++
 ...-data-structure-from-dbapihook-derived-hooks.md | 139 +++++++++++++++++++++
 ...it_check_providers_subpackages_all_have_init.py |  18 +--
 3 files changed, 197 insertions(+), 8 deletions(-)

diff --git 
a/airflow/providers/common/sql/doc/adr/0001-record-architecture-decisions.md 
b/airflow/providers/common/sql/doc/adr/0001-record-architecture-decisions.md
new file mode 100644
index 0000000000..479f9d066d
--- /dev/null
+++ b/airflow/providers/common/sql/doc/adr/0001-record-architecture-decisions.md
@@ -0,0 +1,48 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+ -->
+
+<!-- START doctoc generated TOC please keep comment here to allow auto update 
-->
+<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
+**Table of Contents**  *generated with 
[DocToc](https://github.com/thlorenz/doctoc)*
+
+- [1. Record architecture decisions](#1-record-architecture-decisions)
+  - [Status](#status)
+  - [Context](#context)
+  - [Decision](#decision)
+  - [Consequences](#consequences)
+
+# 1. Record architecture decisions
+
+Date: 2023-12-01
+
+## Status
+
+Accepted
+
+## Context
+
+We need to record the architectural decisions made on this project.
+
+## Decision
+
+We will use Architecture Decision Records, as [described by Michael 
Nygard](http://thinkrelevance.com/blog/2011/11/15/documenting-architecture-decisions).
+
+## Consequences
+
+See Michael Nygard's article, linked above. For a lightweight ADR toolset, see 
Nat Pryce's [adr-tools](https://github.com/npryce/adr-tools).
diff --git 
a/airflow/providers/common/sql/doc/adr/0002-return-common-data-structure-from-dbapihook-derived-hooks.md
 
b/airflow/providers/common/sql/doc/adr/0002-return-common-data-structure-from-dbapihook-derived-hooks.md
new file mode 100644
index 0000000000..e6f8834395
--- /dev/null
+++ 
b/airflow/providers/common/sql/doc/adr/0002-return-common-data-structure-from-dbapihook-derived-hooks.md
@@ -0,0 +1,139 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+ -->
+
+# 2. Return common data structure from DBApiHook derived hooks
+
+Date: 2023-12-01
+
+## Status
+
+Accepted
+
+## Context
+
+Note: This ADR describes the decision made (but not recorded) when common.sql 
provider had been
+introduced in July 2022, but this ADR is recorded in December 2023 to make 
sure the decision is
+recorded.
+
+Before common.sql provider, we had a number of DBAPI-derived Hooks which 
interfaced with Python
+DBAPI-compliant database and the format of data returned from these hooks were 
different and
+dependent on the implementation of the DBAPI interface as well as 
implementation of the Hooks and operators.
+We also had a lot of very similar operators that performed similar tasks - 
like Querying, Sensing the
+database, column check etc. This led to a lot of code duplication, and we 
decided that we need a common set
+of operators that can be used across all the DBAPI-compliant databases.
+
+Unfortunately there is no common standard for data returned by Python DBAPI 
interface. It's partially
+standardized by [PEP-0249](https://peps.python.org/pep-0249/), but the 
specification is not very
+strict and it allows for a lot of flexibility and interpretation. 
Consequently, the data returned
+by the DBAPI interface can contain tuples, named tuples, lists, dictionaries, 
or even custom objects and
+there are no guarantees that the data returned by the DBAPI interface is 
directly serializable.
+
+Not having a common standard format made it difficult to implement planned 
open-lineage column-level
+integration with the database Operators and - in later stage Hooks.
+
+## Decision
+
+We decided to introduce a common.sql provider that would contain a set of 
operators that can be used
+across all the DBAPI-compliant databases. We also decided that the returned 
data structure from the
+operators should be consistent across all the operators. For simplicity of 
transition we chose the format
+returned by the `run` method of the ``DBApiHook`` class that was very close to 
what most of the
+DBAPI-compliant Hooks already returned, even if it was not the most optimal 
format. In this case
+backwards compatibility trumped the optimal format.
+
+The decision has been made that more optimal formats (possibly some form of 
DataFrames) might be
+introduced in the future if we find a need for that. However, this is likely 
not even needed in the
+future because Pandas and similar libraries already have excellent support for 
converting many of
+the formats returned by the DBAPI interface to DataFrames and we are already 
leveraging those
+by directly using Pandas functionality via `get_pands_df` method of the 
``DBApiHook`` class
+
+The goal of the change was to standardize the format of the data returned that 
could be
+directly used through existing DBAPI Airflow operators, with minimal  
backwards-compatibility problems,
+and resulting in deprecating and redirecting all the existing DBAPI operators 
to the new operators
+defined in the common.sql provider.
+
+Therefore, the format of data returned by the Hook can be one of:
+
+* base return value is a list of tuples where each tuple is a row returned by 
the query. The tuples
+  are not named tuples, they should be directly serializable and they should 
not contain
+  column metadata like column names.
+* a tuple (same properties as above) - in case the query returns a single row -
+  based on parameters passed (e.g. `return_last=True`)
+* it can also be a list of list of tuples in case the operator receives 
multiple queries to execute,
+  in which case the return value is a list of list of tuples, where each list 
of tuples is a result
+  of a single query. This is also the default format when ``split_statements`` 
parameter is set to
+  ``True`` indicating the query passed contains multiple statements and it's 
up to the implementation
+  of the hook to split the statements and execute them separately.
+* None - in case the run method is used to execute a query that does not 
return any rows (e.g. `INSERT`)
+
+Additionally, it has been agreed to that description returned by the `run` 
method should be stored in
+the ``descriptions`` field of the Hook - separately from the returned data. 
This is because the description
+is not always available in the DBAPI interface, and it's not always available 
in the same format - the
+format of ``descriptions`` field is that it is always an array of 7-element 
tuples described in PEP-0249 -
+each element of the array correspond to a single query passed to the hook. 
There is also a separate
+``last_description`` property that contains the description of the last query 
passed to the hook,
+particularly, the only query if there was only one.
+
+The DBApiHook implements the base `run` method that returns the list of tuples 
directly from the DBAPI
+result, provide. The `run` method is extendable - it can receive handlers that 
can modify the data
+returned by the direct DBAPI calls to the common format.
+
+For backwards compatibility, the Hooks could implement their own parameters 
that should allow to return
+the "old" format of the data. For example in SnowflakeHook we have 
`return_dict` parameter that allows
+to return the data as a list of dictionaries instead of a list of tuples. This 
allowed to keep
+easy backwards compatibility with the existing operators that our users 
already had, and allowed for
+an easy, even gradual transition to new common format if the user would see 
the benefit of doing so
+(for example to support column based open-lineage or simplify the usage of 
deprecated operators
+and switch to the new ones from the old, deprecated specialized operators.
+
+We also decided that the common.sql provider implements the operators that can 
instantiate the Hooks
+based on the connection type/URI and execute the commands using that hook, 
while deprecating the use
+of the specialized ones:
+
+The new operators are:
+
+* ``SQLExecuteQueryOperator`` - to execute SQL queries
+* ``SQLColumnCheckOperator`` - to perform various checks (declarative) on 
columns in the database and
+  return matching rows
+* ``SQLTableCheckOperator`` - to perform various checks (declarative) on 
tables in the database and
+  return calculated results (usually aggregated calculations)
+* ``SQLCheckOperator`` - to perform various checks (declarative) on the 
database and return success/failure
+  based on the checks performed
+* ``SQLValueCheckOperator`` - to perform various checks (declarative) on 
values in the database and
+  return matching rows
+* ``SQLIntervalCheckOperator`` - to perform various checks (declarative) on 
intervals in the database and
+  return matching rows
+* ``BranchSQLOperator``- branch operator where branching is based on sql and 
parameters provided
+* ``SQLSensor`` - sensor that waits for a certain condition to be met in the 
database
+
+
+## Consequences
+
+The consequence of this decision is that the data returned by the operators in 
the common.sql provider
+is consistent across all the operators and it's easy to use it for various 
databases in similar way
+and easy to integrate with other components like open-lineage. The data 
returned by the operators
+is not the most optimal format for all the databases, but it's a good 
compromise that allows for
+easy transition and backwards compatibility with the existing operators.
+
+Column-lineage integration is possible with the data returned by the operators 
in the common.sql
+provider as well as returning the data directly by the hook to be stored in 
XCom or other ways SQL output
+can be serialized to (for example CSV records).
+
+A number of SQL/DBAPI operators implemented is therefore much smaller and the 
code to maintain is
+only present in the common.sql provider. The code is also much more consistent 
and easier to maintain
+and extend.
diff --git 
a/scripts/ci/pre_commit/pre_commit_check_providers_subpackages_all_have_init.py 
b/scripts/ci/pre_commit/pre_commit_check_providers_subpackages_all_have_init.py
index f06425a92b..2407bffe9a 100755
--- 
a/scripts/ci/pre_commit/pre_commit_check_providers_subpackages_all_have_init.py
+++ 
b/scripts/ci/pre_commit/pre_commit_check_providers_subpackages_all_have_init.py
@@ -20,17 +20,20 @@ from __future__ import annotations
 import os
 import sys
 from glob import glob
+from pathlib import Path
 
 ROOT_DIR = os.path.abspath(os.path.join(os.path.dirname(__file__), os.pardir, 
os.pardir, os.pardir))
 
 
 def check_dir_init_file(provider_files: list[str]) -> None:
-    missing_init_dirs = []
-    for path in provider_files:
-        if path.endswith("/__pycache__"):
+    missing_init_dirs: list[Path] = []
+    for candidate_path in provider_files:
+        if candidate_path.endswith("/__pycache__"):
             continue
-        if os.path.isdir(path) and not os.path.exists(os.path.join(path, 
"__init__.py")):
-            missing_init_dirs.append(path)
+        path = Path(candidate_path)
+        if path.is_dir() and not (path / "__init__.py").exists():
+            if path.name != "adr" and path.name != "doc":
+                missing_init_dirs.append(path)
 
     if missing_init_dirs:
         with open(os.path.join(ROOT_DIR, 
"scripts/ci/license-templates/LICENSE.txt")) as license:
@@ -38,11 +41,10 @@ def check_dir_init_file(provider_files: list[str]) -> None:
         prefixed_licensed_txt = [f"# {line}" if line != "\n" else "#\n" for 
line in license_txt]
 
         for missing_init_dir in missing_init_dirs:
-            with open(os.path.join(missing_init_dir, "__init__.py"), "w") as 
init_file:
-                init_file.write("".join(prefixed_licensed_txt))
+            (missing_init_dir / 
"__init__.py").write_text("".join(prefixed_licensed_txt))
 
         print("No __init__.py file was found in the following provider 
directories:")
-        print("\n".join(missing_init_dirs))
+        print("\n".join([missing_init_dir.as_posix() for missing_init_dir in 
missing_init_dirs]))
         print("\nThe missing __init__.py files have been created. Please add 
these new files to a commit.")
         sys.exit(1)
 

Reply via email to