rahul-madaan commented on code in PR #66844:
URL: https://github.com/apache/airflow/pull/66844#discussion_r3240176523


##########
providers/amazon/src/airflow/providers/amazon/aws/hooks/athena_sql.py:
##########
@@ -177,6 +177,36 @@ def _get_conn_params(self) -> dict[str, str | None]:
             aws_domain=self.conn.extra_dejson.get("aws_domain", 
"amazonaws.com"),
         )
 
+    def get_openlineage_database_info(self, connection):
+        """Return Amazon Athena specific information for OpenLineage."""
+        from airflow.providers.openlineage.sqlparser import DatabaseInfo
+
+        region_name = connection.extra_dejson.get("region_name") or 
self.region_name
+        authority = f"athena.{region_name}.amazonaws.com" if region_name else 
"athena.amazonaws.com"
+
+        return DatabaseInfo(
+            scheme="awsathena",
+            authority=authority,
+            information_schema_columns=[
+                "table_schema",
+                "table_name",
+                "column_name",
+                "ordinal_position",
+                "data_type",
+                "table_catalog",
+            ],
+            database=connection.extra_dejson.get("catalog", "AwsDataCatalog"),
+            is_information_schema_cross_db=True,

Review Comment:
   Hi @kacpermuda, yes — validated against a real Athena instance before 
opening the PR.
   
   ## Full evidence
   
   ### Engine version is Trino (v3) — confirms dialect choice
   
   ```bash
   $ aws athena list-work-groups --region us-east-1
   {
     "WorkGroups": [
       {
         "Name": "primary",
         "EngineVersion": { "EffectiveEngineVersion": "Athena engine version 3" 
}
       }
     ]
   }
   ```
   
   ### `AwsDataCatalog` is the live default catalog
   
   ```bash
   $ aws athena list-data-catalogs --region us-east-1
   {
     "DataCatalogsSummary": [
       {
         "CatalogName": "AwsDataCatalog",
         "Type": "GLUE",
         "Status": "CREATE_COMPLETE"
       }
     ]
   }
   ```
   
   ### Real query against `information_schema.columns` with the exact 6 columns 
declared by the hook — confirms `information_schema_columns` is correct
   
   ```bash
   $ aws athena start-query-execution \
       --query-string "
         SELECT table_schema, table_name, column_name, ordinal_position, 
data_type, table_catalog
         FROM information_schema.columns
         WHERE table_schema='information_schema'
         LIMIT 3
       " \
       --query-execution-context 
"Database=information_schema,Catalog=AwsDataCatalog" ...
   ```
   
   **Result:** `State=SUCCEEDED`, `DataScanned=3401 bytes`, 
`EngineVersion="Athena engine version 3"`
   
   | table_schema       | table_name       | column_name  | ordinal_position | 
data_type | table_catalog   |
   | ------------------ | ---------------- | ------------ | ---------------- | 
--------- | --------------- |
   | information_schema | applicable_roles | grantee      | 1                | 
varchar   | awsdatacatalog  |
   | information_schema | applicable_roles | grantee_type | 2                | 
varchar   | awsdatacatalog  |
   | information_schema | applicable_roles | role_name    | 3                | 
varchar   | awsdatacatalog  |
   
   All six column names project correctly with the expected types — same as 
`TrinoHook`.
   
   ### Real cross-DB query against `information_schema.tables` succeeds — 
confirms `is_information_schema_cross_db=True`
   
   ```bash
   $ aws athena start-query-execution \
       --query-string "
         SELECT table_catalog, table_schema, table_name
         FROM information_schema.tables
         WHERE table_schema='information_schema'
         LIMIT 3
       " ...
   ```
   
   **Result:** `State=SUCCEEDED`, `DataScanned=452 bytes`
   
   ## On `use_flat_cross_db_query`
   
   Good catch on the Redshift comparison. I deliberately left it as the default 
`False` because Athena and Redshift have fundamentally different metadata 
models:
   
   - **Redshift** uses `SVV_REDSHIFT_COLUMNS`, a single global system view 
spanning all databases. That's why it needs `use_flat_cross_db_query=True` — to 
query the one view with `WHERE`-clause database filters.
   - **Athena/Trino** uses the standard per-catalog `information_schema` (the 
result above shows `table_catalog = awsdatacatalog` populated correctly). 
There's no single global view; cross-DB queries work natively via Trino's 
3-part naming, which is what `use_flat_cross_db_query=False` + 
`is_information_schema_cross_db=True` generates: per-database queries combined 
with `UNION ALL`.
   
   This matches `TrinoHook.get_openlineage_database_info()` 1:1 — `TrinoHook` 
also doesn't set `use_flat_cross_db_query`, and its 
`information_schema_columns` list is identical. Athena engine v3 is Trino under 
the hood, so the same OL parameters apply.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to