[
https://issues.apache.org/jira/browse/AIRFLOW-5060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ilya Kisil updated AIRFLOW-5060:
--------------------------------
Description:
h2. Use Case
Imagine that you stream data into S3 bucket of an *account A* and update AWS
Glue datacatalog on a daily basis, so that you can query new data with AWS
Athena. Now let's assume that you provided access to this S3 bucket for an
external *account B* who wants to use its' own AWS Athena to query your data in
an exactly the same way. Unfortunately, an *account B* would need to have
exactly the same table definitions in its AWS Glue Datacatalog, because AWS
Athena cannot run against external glue datacatalog. However, AWS Glue service
supports [cross-account datacatalog
access|https://docs.aws.amazon.com/glue/latest/dg/cross-account-access.html],
which means that *account B* can simply copy/sync meta information about
database, tables, partitions etc from glue data catalog of an *account A*,
provided additional permissions have been granted. Thus, all methods in
*AwsGlueCatalogHook* should an use "CatalogId", i.e. ID of the Data Catalog
from which to retrieve/create/delete.
h2.
h2. How it fits into Airflow
Assume that you have an AWSAthenaOperator, which queries data once a day, then
result is retrieved, visualised locally and then uploaded to some
server/website. Then before this happens, you simply need to create an operator
(even PythonOperator would do) which has two hooks, one to source catalog and
another to destination catalog. At run time, it would use source hook retrieve
information from *account A*, for example
[get_partitions()|https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.get_partitions],
then parse response and remove unnseccary keys and finally use destination
hook to update *account B* datacatalog with
[batch_create_partitions()|[https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.batch_create_partition]]
h2. Proposal
* Add a parameter *catalog_id* to AwsGlueCatalogHook, which then will be used
in all its methods, regardless of this hook associated with source or
destination datacatalog.
* In order not to break exsisting implementation, we set *catalog_id=None.*
But we add method *fallback_catalog_id(),* which uses AWS STS to infer Catalog
ID associated with used *aws_conn_id.* Obtained value * *would be used if
*catalog_id* hasn't been provided during hook creation.
* Extend available methods of *AwsGlueCatalogHook* in a similar way to already
exsisting once, for convenience of the workflow described above. Note: all new
methods should strictly adhere AWS Glue Client Request Syntax and do it in
transparent manner. This means, that input information shouldn't be modified
within a method. When such actions are required, they should be performed
outside of the AwsGlueCatalogHook.
h2. Implementation
* I am happy to contribute to airflow if this feature request gets approved.
h2. Other considerations
* At the moment an existing method *get_partitions* doesn't not provide you
with all metainformation about partitions available from glue client, whereas
*get_table* does. Don't know the best way around it, but imho it should be
refactored to *get_partitions_values* or something like that. In this way, we
would be able to stay inline with boto3 glue client.
was:
h2. Use Case
Imagine that you stream data into S3 bucket of an *account A* and update AWS
Glue datacatalog on a daily basis, so that you can query new data with AWS
Athena. Now let's assume that you provided access to this S3 bucket for an
external *account B* who wants to use its' own AWS Athena to query your data in
an exactly the same way. Unfortunately, an *account B* would need to have
exactly the same table definitions in its AWS Glue Datacatalog, because AWS
Athena cannot run against external glue datacatalog. However, AWS Glue service
supports [cross-account datacatalog
access|[https://docs.aws.amazon.com/glue/latest/dg/cross-account-access.html]],
which means that *account B* can simply copy/sync meta information about
database, tables, partitions etc from glue data catalog of an *account A*,
provided additional permissions have been granted. Thus, all methods in
*AwsGlueCatalogHook* should an use "CatalogId", i.e. ID of the Data Catalog
from which to retrieve/create/delete.
h2.
h2. How it fits into Airflow
Assume that you have an AWSAthenaOperator, which queries data once a day, then
result is retrieved, visualised locally and then uploaded to some
server/website. Then before this happens, you simply need to create an operator
(even PythonOperator would do) which has two hooks, one to source catalog and
another to destination catalog. At run time, it would use source hook retrieve
information from *account A*, for example
[get_partitions()|[https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.get_partitions],
then parse response and remove unnseccary keys and finally use destination
hook to update *account B* datacatalog with
[batch_create_partitions()|[https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.batch_create_partition]]
h2. Proposal
* Add a parameter *catalog_id* to AwsGlueCatalogHook, which then will be used
in all its methods, regardless of this hook associated with source or
destination datacatalog.
* In order not to break exsisting implementation, we set *catalog_id=None.*
But we add method *fallback_catalog_id(),* which uses AWS STS to infer Catalog
ID associated with used *aws_conn_id.* Obtained value * *would be used if
*catalog_id* hasn't been provided during hook creation.
* Extend available methods of *AwsGlueCatalogHook* in a similar way to already
exsisting once, for convenience of the workflow described above. Note: all new
methods should strictly adhere AWS Glue Client Request Syntax and do it in
transparent manner. This means, that input information shouldn't be modified
within a method. When such actions are required, they should be performed
outside of the AwsGlueCatalogHook.
h2. Implementation
* I am happy to contribute to airflow if this feature request gets approved.
h2. Other considerations
* At the moment an existing method *get_partitions* doesn't not provide you
with all metainformation about partitions available from glue client, whereas
*get_table* does. Don't know the best way around it, but imho it should be
refactored to *get_partitions_values* or something like that. In this way, we
would be able to stay inline with boto3 glue client.
> Add support of CatalogId to AwsGlueCatalogHook
> ----------------------------------------------
>
> Key: AIRFLOW-5060
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5060
> Project: Apache Airflow
> Issue Type: New Feature
> Components: hooks
> Affects Versions: 1.10.3
> Reporter: Ilya Kisil
> Assignee: Ilya Kisil
> Priority: Minor
>
> h2. Use Case
> Imagine that you stream data into S3 bucket of an *account A* and update AWS
> Glue datacatalog on a daily basis, so that you can query new data with AWS
> Athena. Now let's assume that you provided access to this S3 bucket for an
> external *account B* who wants to use its' own AWS Athena to query your data
> in an exactly the same way. Unfortunately, an *account B* would need to have
> exactly the same table definitions in its AWS Glue Datacatalog, because AWS
> Athena cannot run against external glue datacatalog. However, AWS Glue
> service supports [cross-account datacatalog
> access|https://docs.aws.amazon.com/glue/latest/dg/cross-account-access.html],
> which means that *account B* can simply copy/sync meta information about
> database, tables, partitions etc from glue data catalog of an *account A*,
> provided additional permissions have been granted. Thus, all methods in
> *AwsGlueCatalogHook* should an use "CatalogId", i.e. ID of the Data Catalog
> from which to retrieve/create/delete.
> h2.
> h2. How it fits into Airflow
> Assume that you have an AWSAthenaOperator, which queries data once a day,
> then result is retrieved, visualised locally and then uploaded to some
> server/website. Then before this happens, you simply need to create an
> operator (even PythonOperator would do) which has two hooks, one to source
> catalog and another to destination catalog. At run time, it would use source
> hook retrieve information from *account A*, for example
> [get_partitions()|https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.get_partitions],
> then parse response and remove unnseccary keys and finally use destination
> hook to update *account B* datacatalog with
> [batch_create_partitions()|[https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.batch_create_partition]]
>
> h2. Proposal
> * Add a parameter *catalog_id* to AwsGlueCatalogHook, which then will be
> used in all its methods, regardless of this hook associated with source or
> destination datacatalog.
> * In order not to break exsisting implementation, we set *catalog_id=None.*
> But we add method *fallback_catalog_id(),* which uses AWS STS to infer
> Catalog ID associated with used *aws_conn_id.* Obtained value * *would be
> used if *catalog_id* hasn't been provided during hook creation.
> * Extend available methods of *AwsGlueCatalogHook* in a similar way to
> already exsisting once, for convenience of the workflow described above.
> Note: all new methods should strictly adhere AWS Glue Client Request Syntax
> and do it in transparent manner. This means, that input information shouldn't
> be modified within a method. When such actions are required, they should be
> performed outside of the AwsGlueCatalogHook.
> h2. Implementation
> * I am happy to contribute to airflow if this feature request gets approved.
> h2. Other considerations
> * At the moment an existing method *get_partitions* doesn't not provide you
> with all metainformation about partitions available from glue client, whereas
> *get_table* does. Don't know the best way around it, but imho it should be
> refactored to *get_partitions_values* or something like that. In this way, we
> would be able to stay inline with boto3 glue client.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)