[jira] [Created] (AIRFLOW-5060) Add support of CatalogId to AwsGlueCatalogHook

Ilya Kisil (JIRA) Sun, 28 Jul 2019 10:10:08 -0700

Ilya Kisil created AIRFLOW-5060:
-----------------------------------

             Summary: Add support of CatalogId to AwsGlueCatalogHook
                 Key: AIRFLOW-5060
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5060
             Project: Apache Airflow
          Issue Type: New Feature
          Components: hooks
    Affects Versions: 1.10.3
            Reporter: Ilya Kisil
            Assignee: Ilya Kisil

h2. Use Case

Imagine that you stream data into S3 bucket of an *account A* and update AWS
Glue datacatalog on a daily basis, so that you can query new data with AWS
Athena. Now let's assume that you provided access to this S3 bucket for an
external *account B* who wants to use its' own AWS Athena to query your data in
an exactly the same way. Unfortunately, an *account B* would need to have
exactly the same table definitions in its AWS Glue Datacatalog, because AWS
Athena cannot run against external glue datacatalog. However, AWS Glue service
supports [cross-account datacatalog
access|[https://docs.aws.amazon.com/glue/latest/dg/cross-account-access.html]],
which means that *account B* can simply copy/sync meta information about
database, tables, partitions etc from glue data catalog of an *account A*,
provided additional permissions have been granted. Thus, all methods in
*AwsGlueCatalogHook* should an use "CatalogId", i.e. ID of the Data Catalog
from which to retrieve/create/delete.
h2.
h2. How it fits into Airflow

Assume that you have an AWSAthenaOperator, which queries data once a day, then
result is retrieved, visualised locally and then uploaded to some
server/website. Then before this happens, you simply need to create an operator
(even PythonOperator would do) which has two hooks, one to source catalog and
another to destination catalog. At run time, it would use source hook retrieve
information from *account A*, for example
[get_partitions()|[https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.get_partitions],
then parse response and remove unnseccary keys and finally use destination
hook to update *account B* datacatalog with
[batch_create_partitions()|[https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.batch_create_partition]]

h2. Proposal
* Add a parameter *catalog_id* to AwsGlueCatalogHook, which then will be used
in all its methods, regardless of this hook associated with source or
destination datacatalog.
* In order not to break exsisting implementation, we set *catalog_id=None.*
But we add method *fallback_catalog_id(),* which uses AWS STS to infer Catalog
ID associated with used *aws_conn_id.* Obtained value ** would be used if
*catalog_id* hasn't been provided during hook creation.
* Extend available methods of *AwsGlueCatalogHook* in a similar way to already
exsisting once, for convenience of the workflow described above. Note: all new
methods should strictly adhere AWS Glue Client Request Syntax and do it in
transparent manner. This means, that input information shouldn't be modified
within a method. When such actions are required, they should be performed
outside of the AwsGlueCatalogHook.

h2. Implementation
* I am happy to contribute to airflow if this feature request gets approved.

h2. Other considerations
* At the moment an existing method *get_partitions* doesn't not provide you
with all metainformation about partitions available from glue client, whereas
*get_table* does. Don't know the best way around it, but imho it should be
refactored to *get_partitions_values* or something like that. In this way, we
would be able to stay inline with boto3 glue client.

--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (AIRFLOW-5060) Add support of CatalogId to AwsGlueCatalogHook

Reply via email to