[
https://issues.apache.org/jira/browse/AIRFLOW-5060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16978309#comment-16978309
]
Ilya Kisil commented on AIRFLOW-5060:
-------------------------------------
[~jackjack10] another thing when extending glue hook with other available
methods: unit tests. Not sure how straight forward they would be etc.
> Add support of CatalogId to AwsGlueCatalogHook
> ----------------------------------------------
>
> Key: AIRFLOW-5060
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5060
> Project: Apache Airflow
> Issue Type: New Feature
> Components: hooks
> Affects Versions: 1.10.3
> Reporter: Ilya Kisil
> Assignee: Ilya Kisil
> Priority: Minor
>
> h2. Use Case
> Imagine that you stream data into S3 bucket of an *account A* and update AWS
> Glue datacatalog on a daily basis, so that you can query new data with AWS
> Athena. Now let's assume that you provided access to this S3 bucket for an
> external *account B* who wants to use its' own AWS Athena to query your data
> in an exactly the same way. Unfortunately, an *account B* would need to have
> exactly the same table definitions in its AWS Glue Datacatalog, because AWS
> Athena cannot run against external glue datacatalog. However, AWS Glue
> service supports [cross-account datacatalog
> access|https://docs.aws.amazon.com/glue/latest/dg/cross-account-access.html],
> which means that *account B* can simply copy/sync meta information about
> database, tables, partitions etc from glue data catalog of an *account A*,
> provided additional permissions have been granted. Thus, all methods in
> *AwsGlueCatalogHook* should an use "CatalogId", i.e. ID of the Data Catalog
> from which to retrieve/create/delete.
> h2.
> h2. How it fits into Airflow
> Assume that you have an AWSAthenaOperator, which queries data once a day,
> then result is retrieved, visualised locally and then uploaded to some
> server/website. Then before this happens, you simply need to create an
> operator (even PythonOperator would do) which has two hooks, one to source
> catalog and another to destination catalog. At run time, it would use source
> hook retrieve information from *account A*, for example
> [get_partitions()|https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.get_partitions],
> then parse response and remove unnseccary keys and finally use destination
> hook to update *account B* datacatalog with
> [batch_create_partitions()|[https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.batch_create_partition]]
>
> h2. Proposal
> * Add a parameter *catalog_id* to AwsGlueCatalogHook, which then will be
> used in all its methods, regardless of this hook associated with source or
> destination datacatalog.
> * In order not to break exsisting implementation, we set *catalog_id=None.*
> But we add method *fallback_catalog_id(),* which uses AWS STS to infer
> Catalog ID associated with used *aws_conn_id.* Obtained value * *would be
> used if *catalog_id* hasn't been provided during hook creation.
> * Extend available methods of *AwsGlueCatalogHook* in a similar way to
> already exsisting once, for convenience of the workflow described above.
> Note: all new methods should strictly adhere AWS Glue Client Request Syntax
> and do it in transparent manner. This means, that input information shouldn't
> be modified within a method. When such actions are required, they should be
> performed outside of the AwsGlueCatalogHook.
> h2. Implementation
> * I am happy to contribute to airflow if this feature request gets approved.
> h2. Other considerations
> * At the moment an existing method *get_partitions* doesn't not provide you
> with all metainformation about partitions available from glue client, whereas
> *get_table* does. Don't know the best way around it, but imho it should be
> refactored to *get_partitions_values* or something like that. In this way, we
> would be able to stay inline with boto3 glue client.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)