[GitHub] [airflow] denimalpaca commented on a diff in pull request #25164: Common SQLCheckOperators Various Functionality Update

GitBox Thu, 25 Aug 2022 07:07:10 -0700


denimalpaca commented on code in PR #25164:
URL: https://github.com/apache/airflow/pull/25164#discussion_r955014928



##########
airflow/providers/common/sql/operators/sql.py:
##########
@@ -273,38 +303,38 @@ def __init__(
 
         self.table = table
         self.checks = checks
+        self.partition_clause = partition_clause
         # OpenLineage needs a valid SQL query with the input/output table(s) 
to parse
         self.sql = f"SELECT * FROM {self.table};"
 
     def execute(self, context=None):
         hook = self.get_db_hook()
-
-        check_names = [*self.checks]
-        check_mins_sql = ",".join(
-            self.sql_min_template.replace("check_name", check_name) for 
check_name in check_names
-        )
-        checks_sql = ",".join(
+        checks_sql = " UNION ALL ".join(
             [
-                self.sql_check_template.replace("check_statement", 
value["check_statement"]).replace(
-                    "check_name", check_name
-                )
+                self.sql_check_template.replace("check_statement", 
value["check_statement"])
+                .replace("_check_name", check_name)
+                .replace("table", self.table)
                 for check_name, value in self.checks.items()
             ]
         )
+        partition_clause_statement = f"WHERE {self.partition_clause}" if 
self.partition_clause else ""
+        self.sql = f"SELECT check_name, check_result FROM ({checks_sql}) "
+        f"AS check_table {partition_clause_statement};"
 
-        self.sql = f"SELECT {check_mins_sql} FROM (SELECT {checks_sql} FROM 
{self.table});"
-        records = hook.get_first(self.sql)
+        records = hook.get_pandas_df(self.sql)

Review Comment:
   So, this was changed because with `hook.get_first`, there was an issue with 
how the SQL was being written that caused *only* fully aggregated checks to be 
returned, unless the syntax of the SQL query was changed, but that would 
require either a `fetch_all` or `get_pandas_df` call as the new SQL needs to 
returned multiple lines. It seemed much easier and possibly more efficient to 
use pandas here, but if a `fetch_all` seems more reasonable this can be 
changed. Happy to explain more about the specific issue if curious.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] denimalpaca commented on a diff in pull request #25164: Common SQLCheckOperators Various Functionality Update

Reply via email to