[jira] [Commented] (AIRFLOW-5060) Add support of CatalogId to AwsGlueCatalogHook

Ilya Kisil (Jira) Wed, 20 Nov 2019 02:56:25 -0800


    [ 
https://issues.apache.org/jira/browse/AIRFLOW-5060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16978309#comment-16978309
 ]


Ilya Kisil commented on AIRFLOW-5060:
-------------------------------------

[~jackjack10] another thing when extending glue hook with other available 
methods: unit tests. Not sure how straight forward they would be etc.

> Add support of CatalogId to AwsGlueCatalogHook
> ----------------------------------------------
>
>                 Key: AIRFLOW-5060
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5060
>             Project: Apache Airflow
>          Issue Type: New Feature
>          Components: hooks
>    Affects Versions: 1.10.3
>            Reporter: Ilya Kisil
>            Assignee: Ilya Kisil
>            Priority: Minor
>
> h2. Use Case
> Imagine that you stream data into S3 bucket of an *account A* and update AWS 
> Glue datacatalog on a daily basis, so that you can query new data with AWS 
> Athena. Now let's assume that you provided access to this S3 bucket for an 
> external *account B* who wants to use its' own AWS Athena to query your data 
> in an exactly the same way. Unfortunately, an *account B* would need to have 
> exactly the same table definitions in its AWS Glue Datacatalog, because AWS 
> Athena cannot run against external glue datacatalog. However, AWS Glue 
> service supports [cross-account datacatalog 
> access|https://docs.aws.amazon.com/glue/latest/dg/cross-account-access.html], 
> which means that *account B* can simply copy/sync meta information about 
> database, tables, partitions etc from glue data catalog of an *account A*, 
> provided additional permissions have been granted. Thus, all methods in 
> *AwsGlueCatalogHook* should an use "CatalogId", i.e. ID of the Data Catalog 
> from which to retrieve/create/delete.
> h2.  
> h2. How it fits into Airflow
> Assume that you have an AWSAthenaOperator, which queries data once a day, 
> then result is retrieved, visualised locally and then uploaded to some 
> server/website. Then before this happens, you simply need to create an 
> operator (even PythonOperator would do) which has two hooks, one to source 
> catalog and another to destination catalog. At run time, it would use source 
> hook retrieve information from *account A*, for example 
> [get_partitions()|https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.get_partitions],
>  then parse response and remove unnseccary keys and finally use destination 
> hook to update *account B* datacatalog with 
> [batch_create_partitions()|[https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.batch_create_partition]]
>  
> h2. Proposal
>  * Add a parameter *catalog_id* to AwsGlueCatalogHook, which then will be 
> used in all its methods, regardless of this hook associated with source or 
> destination datacatalog. 
>  * In order not to break exsisting implementation, we set *catalog_id=None.* 
> But we add method *fallback_catalog_id(),* which uses AWS STS to infer 
> Catalog ID associated with used *aws_conn_id.* Obtained value * *would be 
> used if *catalog_id* hasn't been provided during hook creation.
>  * Extend available methods of *AwsGlueCatalogHook* in a similar way to 
> already exsisting once, for convenience of the workflow described above. 
> Note: all new methods should strictly adhere AWS Glue Client Request Syntax 
> and do it in transparent manner. This means, that input information shouldn't 
> be modified within a method. When such actions are required, they should be 
> performed outside of the AwsGlueCatalogHook.
> h2. Implementation
>  * I am happy to contribute to airflow if this feature request gets approved.
> h2. Other considerations
>  * At the moment an existing method *get_partitions* doesn't not provide you 
> with all metainformation about partitions available from glue client, whereas 
> *get_table* does. Don't know the best way around it, but imho it should be 
> refactored to *get_partitions_values* or something like that. In this way, we 
> would be able to stay inline with boto3 glue client.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (AIRFLOW-5060) Add support of CatalogId to AwsGlueCatalogHook

Reply via email to