Ilya Kisil created AIRFLOW-5060:
-----------------------------------

             Summary: Add support of CatalogId to AwsGlueCatalogHook
                 Key: AIRFLOW-5060
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5060
             Project: Apache Airflow
          Issue Type: New Feature
          Components: hooks
    Affects Versions: 1.10.3
            Reporter: Ilya Kisil
            Assignee: Ilya Kisil


h2. Use Case

Imagine that you stream data into S3 bucket of an *account A* and update AWS 
Glue datacatalog on a daily basis, so that you can query new data with AWS 
Athena. Now let's assume that you provided access to this S3 bucket for an 
external *account B* who wants to use its' own AWS Athena to query your data in 
an exactly the same way. Unfortunately, an *account B* would need to have 
exactly the same table definitions in its AWS Glue Datacatalog, because AWS 
Athena cannot run against external glue datacatalog. However, AWS Glue service 
supports [cross-account datacatalog 
access|[https://docs.aws.amazon.com/glue/latest/dg/cross-account-access.html]], 
which means that *account B* can simply copy/sync meta information about 
database, tables, partitions etc from glue data catalog of an *account A*, 
provided additional permissions have been granted. Thus, all methods in 
*AwsGlueCatalogHook* should an use "CatalogId", i.e. ID of the Data Catalog 
from which to retrieve/create/delete.
h2.  
h2. How it fits into Airflow

Assume that you have an AWSAthenaOperator, which queries data once a day, then 
result is retrieved, visualised locally and then uploaded to some 
server/website. Then before this happens, you simply need to create an operator 
(even PythonOperator would do) which has two hooks, one to source catalog and 
another to destination catalog. At run time, it would use source hook retrieve 
information from *account A*, for example 
[get_partitions()|[https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.get_partitions],
 then parse response and remove unnseccary keys and finally use destination 
hook to update *account B* datacatalog with 
[batch_create_partitions()|[https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.batch_create_partition]]

 
h2. Proposal
 * Add a parameter *catalog_id* to AwsGlueCatalogHook, which then will be used 
in all its methods, regardless of this hook associated with source or 
destination datacatalog. 
 * In order not to break exsisting implementation, we set *catalog_id=None.* 
But we add method *fallback_catalog_id(),* which uses AWS STS to infer Catalog 
ID associated with used *aws_conn_id.* Obtained value ** would be used if 
*catalog_id* hasn't been provided during hook creation.
 * Extend available methods of *AwsGlueCatalogHook* in a similar way to already 
exsisting once, for convenience of the workflow described above. Note: all new 
methods should strictly adhere AWS Glue Client Request Syntax and do it in 
transparent manner. This means, that input information shouldn't be modified 
within a method. When such actions are required, they should be performed 
outside of the AwsGlueCatalogHook.

h2. Implementation
 * I am happy to contribute to airflow if this feature request gets approved.

h2. Other considerations
 * At the moment an existing method *get_partitions* doesn't not provide you 
with all metainformation about partitions available from glue client, whereas 
*get_table* does. Don't know the best way around it, but imho it should be 
refactored to *get_partitions_values* or something like that. In this way, we 
would be able to stay inline with boto3 glue client.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to