mlauter opened a new pull request, #60876:
URL: https://github.com/apache/airflow/pull/60876

   ## Description
   
   Add support for `parquetOptions` in GCSToBigQueryOperator `src_fmt_configs`.
   
   My team has a lot of workflows that involve loading parquet in gcs to 
bigquery using airflow. By default (without 
[parquetOptions.enable_list_inference](https://docs.cloud.google.com/python/docs/reference/bigquery/latest/google.cloud.bigquery.format_options.ParquetOptions#google_cloud_bigquery_format_options_ParquetOptions_enable_list_inference))
 parquet lists get loaded to bigquery as `STRUCT<list ARRAY<STRUCT<element 
$TYPE>>>`. This nested struct is cumbersome to use for querying and analysis.
   
   With the `enable_list_inference` flag, the same parquet list is loaded 
simply as `ARRAY<$TYPE>` which is much easier to work with.
   
   This PR adds support for passing `enableListInference` as one of the options 
in `src_fmt_configs` when the source format is `PARQUET`. This works both for 
the external table code path as well as the bq load code path.
   
   ## Testing
   
   - [x] Added unit tests covering new behavior
   - [x] Ran all unit tests for the operator
   - [ ] Ran GCP system tests 
   - [x] Created a custom dag to exercise the behavior and ran it in the breeze 
environment (see results below)
   
   Without `enableListInference`:
   
   ```
       gcs_to_bq_task = GCSToBigQueryOperator(
           task_id="gcs_to_bigquery_parquet_no_options",
           bucket="<my_bucket>",
           source_objects=["<my_parquet_file_with_list"],
           destination_project_dataset_table="<my_table>",
           source_format="PARQUET",
           write_disposition="WRITE_TRUNCATE",
       )
   ```
   
   Produces:
   <img width="712" height="128" alt="image" 
src="https://github.com/user-attachments/assets/641597fb-00c3-4082-9fc1-a3db4fbbf151";
 />
   
   
   With `enableListInference`:
   
   ```
       gcs_to_bq_task = GCSToBigQueryOperator(
           task_id="gcs_to_bigquery_parquet_no_options",
           bucket="<my_bucket>",
           source_objects=["<my_parquet_file_with_list"],
           destination_project_dataset_table="<my_table>",
           source_format="PARQUET",
           write_disposition="WRITE_TRUNCATE",
           src_fmt_configs={"enableListInference": True},
       )
   ```
   
   produces:
   
   <img width="712" height="46" alt="image" 
src="https://github.com/user-attachments/assets/78f3b6f3-5ce3-4024-97ee-aa6949b6ef4e";
 />
   
   And likewise for the external table case which i also tested.
   
   ---
   
   ##### Was generative AI tooling used to co-author this PR?
   
   - [x] Yes (please specify the tool below)
   
   Generated-by: Claude Sonnet 4.5 following [the 
guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#gen-ai-assisted-contributions)
   
   GenAI tooling was used only for code review and discussion, no lines of code 
in the PR were written or directly copied from Claude.
   
   ---
   
   * Read the **[Pull Request 
Guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#pull-request-guidelines)**
 for more information. Note: commit author/co-author name and email in commits 
become permanently public when merged.
   * For fundamental code changes, an Airflow Improvement Proposal 
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals))
 is needed.
   * When adding dependency, check compliance with the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   * For significant user-facing changes create newsfragment: 
`{pr_number}.significant.rst` or `{issue_number}.significant.rst`, in 
[airflow-core/newsfragments](https://github.com/apache/airflow/tree/main/airflow-core/newsfragments).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to