[GitHub] [airflow] bryan824 opened a new pull request, #23141: perf(BigQuery): pass table_id as str type

GitBox Wed, 20 Apr 2022 19:40:07 -0700


bryan824 opened a new pull request, #23141:
URL: https://github.com/apache/airflow/pull/23141


   Recently during migration from 1.10.14 to 2.2.3, I noticed an issue in the 
`BigQueryDeleteTableOperator`. For the context of this, there are two ways to 
specify a table in GCP BigQuery, one with the project_id, like 
`my-project.mydataset.mytable`, and the other one without project_id, like 
`mydataset.mytable`. 
   
   In 1.10.14, I was using the version without project_id, because the table 
can be recognized by `BigQueryHook`, using `bigquery_conn_id` to fetch 
`project_id` in configuration.
   
   The path to pass this info is: 
[gcp_api_base_hook#L131](https://github.com/apache/airflow/blob/c743b95a02ba1ec04013635a56ad042ce98823d2/airflow/contrib/hooks/gcp_api_base_hook.py#L131)
 ->  
[gcp_api_base_hook#L200](https://github.com/apache/airflow/blob/c743b95a02ba1ec04013635a56ad042ce98823d2/airflow/contrib/hooks/gcp_api_base_hook.py#L200)
 -> 
[bigquery_hook#L71](https://github.com/apache/airflow/blob/c743b95a02ba1ec04013635a56ad042ce98823d2/airflow/contrib/hooks/bigquery_hook.py#L71)
 -> 
[bigquery_hook#L1498](https://github.com/apache/airflow/blob/c743b95a02ba1ec04013635a56ad042ce98823d2/airflow/contrib/hooks/bigquery_hook.py#L1498).
   
   But after upgrading to 2.2.3, a full `table_id` is required. This is 
unexpected because `bigquery_conn_id/gcp_conn_id` is still a valid parameter, 
`BigQueryDeleteTableOperator`  should still be able to get `project_id` 
automatically from the connection configuration. It seems like in this line of 
code 
[bigquery#L1195](https://github.com/apache/airflow/blob/eb26510d3a1ccfaa9e4f8e1e0c91b5c74ae7393e/airflow/providers/google/cloud/hooks/bigquery.py#L1195),
 it forces users to use full `table_id` to create a `Table` instance, which is 
the **_root cause_**.
   
   Method `delete_table` accepts 4 types of tables, such as `Table`, 
`TableReference`, `TableListItem` and `str` as shown in 
[client#L1754](https://github.com/googleapis/python-bigquery/blob/c1d3e3089de1c267f8fb013283289b7d42172c76/google/cloud/bigquery/client.py#L1754).
 Then in 
[client#L1784](https://github.com/googleapis/python-bigquery/blob/c1d3e3089de1c267f8fb013283289b7d42172c76/google/cloud/bigquery/client.py#L1784),
 it converts these 4 types to 1 type, which is `TableReference` as shown in 
[table#L2689](https://github.com/googleapis/python-bigquery/blob/c1d3e3089de1c267f8fb013283289b7d42172c76/google/cloud/bigquery/table.py#L2689).
   
   So back to the possible improvement of this issue, I wonder if it will help 
migration get smoother if instead of using `Table.from_string` to get a `Table` 
type, a `str` type parameter is passed directly. And this `str` parameter can 
be just `mydataset.mytable`, with `project_id` set by the `Client` as shown in 
[bigquery#L1194](https://github.com/apache/airflow/blob/8dedd2ac13a6cdc0c363446985f492e0f702f639/airflow/providers/google/cloud/hooks/bigquery.py#L1194).
 I believe due to the plan of 
[GCP](https://cloud.google.com/composer/docs/composer-2/composer-versioning-overview#version-support-for-composer-1),
 companies are slowly migrating to Airflow 2.0 for better support. This 
improvement will avoid having them add the `project_id`  to `table_id` for 
hundreds of DAGs since it is already included in the connection configuration.
   
   Below are two scenarios based on the two formats of specifying a BigQuery 
table:
   
   1. `table_id` like `mydataset.mytable` is passed in 
[bigquery#L1797](https://github.com/apache/airflow/blob/8dedd2ac13a6cdc0c363446985f492e0f702f639/airflow/providers/google/cloud/operators/bigquery.py#L1797)
 and the corresponding `project_id` is configured by the connection. This will 
work as expected, if no `project_id` is found, error will be captured in 
[_helpers#L825](https://github.com/googleapis/python-bigquery/blob/c1d3e3089de1c267f8fb013283289b7d42172c76/google/cloud/bigquery/_helpers.py#L825).
   2. `table_id` like `my-project.mydataset.mytable` is passed. In this case, 
whether or not the `project_id` is configured or configured correspondingly, it 
will use the `project_id` defined in the `table_id` regardless as shown in 
[_helpers#L836](https://github.com/googleapis/python-bigquery/blob/c1d3e3089de1c267f8fb013283289b7d42172c76/google/cloud/bigquery/_helpers.py#L836).
   
   This is my first attempt at submitting a PR to an open-sourced repo. Please 
let me know how I can improve. It is also fine if it is not worth merging such 
a change. I enjoyed the time when looking into this.
   
   @kaxil @eladkal @potiuk 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] bryan824 opened a new pull request, #23141: perf(BigQuery): pass table_id as str type

Reply via email to