mlauter opened a new pull request, #60876: URL: https://github.com/apache/airflow/pull/60876
## Description Add support for `parquetOptions` in GCSToBigQueryOperator `src_fmt_configs`. My team has a lot of workflows that involve loading parquet in gcs to bigquery using airflow. By default (without [parquetOptions.enable_list_inference](https://docs.cloud.google.com/python/docs/reference/bigquery/latest/google.cloud.bigquery.format_options.ParquetOptions#google_cloud_bigquery_format_options_ParquetOptions_enable_list_inference)) parquet lists get loaded to bigquery as `STRUCT<list ARRAY<STRUCT<element $TYPE>>>`. This nested struct is cumbersome to use for querying and analysis. With the `enable_list_inference` flag, the same parquet list is loaded simply as `ARRAY<$TYPE>` which is much easier to work with. This PR adds support for passing `enableListInference` as one of the options in `src_fmt_configs` when the source format is `PARQUET`. This works both for the external table code path as well as the bq load code path. ## Testing - [x] Added unit tests covering new behavior - [x] Ran all unit tests for the operator - [ ] Ran GCP system tests - [x] Created a custom dag to exercise the behavior and ran it in the breeze environment (see results below) Without `enableListInference`: ``` gcs_to_bq_task = GCSToBigQueryOperator( task_id="gcs_to_bigquery_parquet_no_options", bucket="<my_bucket>", source_objects=["<my_parquet_file_with_list"], destination_project_dataset_table="<my_table>", source_format="PARQUET", write_disposition="WRITE_TRUNCATE", ) ``` Produces: <img width="712" height="128" alt="image" src="https://github.com/user-attachments/assets/641597fb-00c3-4082-9fc1-a3db4fbbf151" /> With `enableListInference`: ``` gcs_to_bq_task = GCSToBigQueryOperator( task_id="gcs_to_bigquery_parquet_no_options", bucket="<my_bucket>", source_objects=["<my_parquet_file_with_list"], destination_project_dataset_table="<my_table>", source_format="PARQUET", write_disposition="WRITE_TRUNCATE", src_fmt_configs={"enableListInference": True}, ) ``` produces: <img width="712" height="46" alt="image" src="https://github.com/user-attachments/assets/78f3b6f3-5ce3-4024-97ee-aa6949b6ef4e" /> And likewise for the external table case which i also tested. --- ##### Was generative AI tooling used to co-author this PR? - [x] Yes (please specify the tool below) Generated-by: Claude Sonnet 4.5 following [the guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#gen-ai-assisted-contributions) GenAI tooling was used only for code review and discussion, no lines of code in the PR were written or directly copied from Claude. --- * Read the **[Pull Request Guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#pull-request-guidelines)** for more information. Note: commit author/co-author name and email in commits become permanently public when merged. * For fundamental code changes, an Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals)) is needed. * When adding dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). * For significant user-facing changes create newsfragment: `{pr_number}.significant.rst` or `{issue_number}.significant.rst`, in [airflow-core/newsfragments](https://github.com/apache/airflow/tree/main/airflow-core/newsfragments). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
