[jira] [Updated] (AIRFLOW-5060) Add support of CatalogId to AwsGlueCatalogHook

Ilya Kisil (Jira) Wed, 20 Nov 2019 02:51:46 -0800


     [ 
https://issues.apache.org/jira/browse/AIRFLOW-5060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ilya Kisil updated AIRFLOW-5060:
--------------------------------
    Description: 
h2. Use Case

Imagine that you stream data into S3 bucket of an *account A* and update AWS 
Glue datacatalog on a daily basis, so that you can query new data with AWS 
Athena. Now let's assume that you provided access to this S3 bucket for an 
external *account B* who wants to use its' own AWS Athena to query your data in 
an exactly the same way. Unfortunately, an *account B* would need to have 
exactly the same table definitions in its AWS Glue Datacatalog, because AWS 
Athena cannot run against external glue datacatalog. However, AWS Glue service 
supports [cross-account datacatalog 
access|https://docs.aws.amazon.com/glue/latest/dg/cross-account-access.html], 
which means that *account B* can simply copy/sync meta information about 
database, tables, partitions etc from glue data catalog of an *account A*, 
provided additional permissions have been granted. Thus, all methods in 
*AwsGlueCatalogHook* should an use "CatalogId", i.e. ID of the Data Catalog 
from which to retrieve/create/delete.
h2.  
h2. How it fits into Airflow

Assume that you have an AWSAthenaOperator, which queries data once a day, then 
result is retrieved, visualised locally and then uploaded to some 
server/website. Then before this happens, you simply need to create an operator 
(even PythonOperator would do) which has two hooks, one to source catalog and 
another to destination catalog. At run time, it would use source hook retrieve 
information from *account A*, for example 
[get_partitions()|https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.get_partitions],
 then parse response and remove unnseccary keys and finally use destination 
hook to update *account B* datacatalog with 
[batch_create_partitions()|[https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.batch_create_partition]]

 
h2. Proposal
 * Add a parameter *catalog_id* to AwsGlueCatalogHook, which then will be used 
in all its methods, regardless of this hook associated with source or 
destination datacatalog. 
 * In order not to break exsisting implementation, we set *catalog_id=None.* 
But we add method *fallback_catalog_id(),* which uses AWS STS to infer Catalog 
ID associated with used *aws_conn_id.* Obtained value * *would be used if 
*catalog_id* hasn't been provided during hook creation.
 * Extend available methods of *AwsGlueCatalogHook* in a similar way to already 
exsisting once, for convenience of the workflow described above. Note: all new 
methods should strictly adhere AWS Glue Client Request Syntax and do it in 
transparent manner. This means, that input information shouldn't be modified 
within a method. When such actions are required, they should be performed 
outside of the AwsGlueCatalogHook.

h2. Implementation
 * I am happy to contribute to airflow if this feature request gets approved.

h2. Other considerations
 * At the moment an existing method *get_partitions* doesn't not provide you 
with all metainformation about partitions available from glue client, whereas 
*get_table* does. Don't know the best way around it, but imho it should be 
refactored to *get_partitions_values* or something like that. In this way, we 
would be able to stay inline with boto3 glue client.

 

  was:
h2. Use Case

Imagine that you stream data into S3 bucket of an *account A* and update AWS 
Glue datacatalog on a daily basis, so that you can query new data with AWS 
Athena. Now let's assume that you provided access to this S3 bucket for an 
external *account B* who wants to use its' own AWS Athena to query your data in 
an exactly the same way. Unfortunately, an *account B* would need to have 
exactly the same table definitions in its AWS Glue Datacatalog, because AWS 
Athena cannot run against external glue datacatalog. However, AWS Glue service 
supports [cross-account datacatalog 
access|[https://docs.aws.amazon.com/glue/latest/dg/cross-account-access.html]], 
which means that *account B* can simply copy/sync meta information about 
database, tables, partitions etc from glue data catalog of an *account A*, 
provided additional permissions have been granted. Thus, all methods in 
*AwsGlueCatalogHook* should an use "CatalogId", i.e. ID of the Data Catalog 
from which to retrieve/create/delete.
h2.  
h2. How it fits into Airflow

Assume that you have an AWSAthenaOperator, which queries data once a day, then 
result is retrieved, visualised locally and then uploaded to some 
server/website. Then before this happens, you simply need to create an operator 
(even PythonOperator would do) which has two hooks, one to source catalog and 
another to destination catalog. At run time, it would use source hook retrieve 
information from *account A*, for example 
[get_partitions()|[https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.get_partitions],
 then parse response and remove unnseccary keys and finally use destination 
hook to update *account B* datacatalog with 
[batch_create_partitions()|[https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.batch_create_partition]]

 
h2. Proposal
 * Add a parameter *catalog_id* to AwsGlueCatalogHook, which then will be used 
in all its methods, regardless of this hook associated with source or 
destination datacatalog. 
 * In order not to break exsisting implementation, we set *catalog_id=None.* 
But we add method *fallback_catalog_id(),* which uses AWS STS to infer Catalog 
ID associated with used *aws_conn_id.* Obtained value * *would be used if 
*catalog_id* hasn't been provided during hook creation.
 * Extend available methods of *AwsGlueCatalogHook* in a similar way to already 
exsisting once, for convenience of the workflow described above. Note: all new 
methods should strictly adhere AWS Glue Client Request Syntax and do it in 
transparent manner. This means, that input information shouldn't be modified 
within a method. When such actions are required, they should be performed 
outside of the AwsGlueCatalogHook.

h2. Implementation
 * I am happy to contribute to airflow if this feature request gets approved.

h2. Other considerations
 * At the moment an existing method *get_partitions* doesn't not provide you 
with all metainformation about partitions available from glue client, whereas 
*get_table* does. Don't know the best way around it, but imho it should be 
refactored to *get_partitions_values* or something like that. In this way, we 
would be able to stay inline with boto3 glue client.

 


> Add support of CatalogId to AwsGlueCatalogHook
> ----------------------------------------------
>
>                 Key: AIRFLOW-5060
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5060
>             Project: Apache Airflow
>          Issue Type: New Feature
>          Components: hooks
>    Affects Versions: 1.10.3
>            Reporter: Ilya Kisil
>            Assignee: Ilya Kisil
>            Priority: Minor
>
> h2. Use Case
> Imagine that you stream data into S3 bucket of an *account A* and update AWS 
> Glue datacatalog on a daily basis, so that you can query new data with AWS 
> Athena. Now let's assume that you provided access to this S3 bucket for an 
> external *account B* who wants to use its' own AWS Athena to query your data 
> in an exactly the same way. Unfortunately, an *account B* would need to have 
> exactly the same table definitions in its AWS Glue Datacatalog, because AWS 
> Athena cannot run against external glue datacatalog. However, AWS Glue 
> service supports [cross-account datacatalog 
> access|https://docs.aws.amazon.com/glue/latest/dg/cross-account-access.html], 
> which means that *account B* can simply copy/sync meta information about 
> database, tables, partitions etc from glue data catalog of an *account A*, 
> provided additional permissions have been granted. Thus, all methods in 
> *AwsGlueCatalogHook* should an use "CatalogId", i.e. ID of the Data Catalog 
> from which to retrieve/create/delete.
> h2.  
> h2. How it fits into Airflow
> Assume that you have an AWSAthenaOperator, which queries data once a day, 
> then result is retrieved, visualised locally and then uploaded to some 
> server/website. Then before this happens, you simply need to create an 
> operator (even PythonOperator would do) which has two hooks, one to source 
> catalog and another to destination catalog. At run time, it would use source 
> hook retrieve information from *account A*, for example 
> [get_partitions()|https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.get_partitions],
>  then parse response and remove unnseccary keys and finally use destination 
> hook to update *account B* datacatalog with 
> [batch_create_partitions()|[https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.batch_create_partition]]
>  
> h2. Proposal
>  * Add a parameter *catalog_id* to AwsGlueCatalogHook, which then will be 
> used in all its methods, regardless of this hook associated with source or 
> destination datacatalog. 
>  * In order not to break exsisting implementation, we set *catalog_id=None.* 
> But we add method *fallback_catalog_id(),* which uses AWS STS to infer 
> Catalog ID associated with used *aws_conn_id.* Obtained value * *would be 
> used if *catalog_id* hasn't been provided during hook creation.
>  * Extend available methods of *AwsGlueCatalogHook* in a similar way to 
> already exsisting once, for convenience of the workflow described above. 
> Note: all new methods should strictly adhere AWS Glue Client Request Syntax 
> and do it in transparent manner. This means, that input information shouldn't 
> be modified within a method. When such actions are required, they should be 
> performed outside of the AwsGlueCatalogHook.
> h2. Implementation
>  * I am happy to contribute to airflow if this feature request gets approved.
> h2. Other considerations
>  * At the moment an existing method *get_partitions* doesn't not provide you 
> with all metainformation about partitions available from glue client, whereas 
> *get_table* does. Don't know the best way around it, but imho it should be 
> refactored to *get_partitions_values* or something like that. In this way, we 
> would be able to stay inline with boto3 glue client.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (AIRFLOW-5060) Add support of CatalogId to AwsGlueCatalogHook

Reply via email to