[GitHub] [hudi] nsivabalan commented on a diff in pull request #9710: [HUDI-6857] Update BigQuerySyncTool docs

via GitHub Wed, 13 Sep 2023 10:24:08 -0700


nsivabalan commented on code in PR #9710:
URL: https://github.com/apache/hudi/pull/9710#discussion_r1324838303



##########
website/docs/gcp_bigquery.md:
##########
@@ -5,7 +5,24 @@ summary: Introduce BigQuery integration in Hudi.
 ---
 
 Hudi tables can be queried from [Google Cloud 
BigQuery](https://cloud.google.com/bigquery) as external tables. As of
-now, the Hudi-BigQuery integration only works for hive-style partitioned 
Copy-On-Write tables.
+now, the Hudi-BigQuery integration only works for hive-style partitioned 
Copy-On-Write and Read-Optimized Merge-On-Read tables.
+
+## Sync Modes
+### Manifest File
+As of version 0.14.0, the `BigQuerySyncTool` supports syncing table to 
BigQuery using 
[manifests](https://cloud.google.com/blog/products/data-analytics/bigquery-manifest-file-support-for-open-table-format-queries).
 On the first run, the tool will create a manifest file representing the 
current base files in the table and a table in BigQuery based on the provided 
configurations. The tool produces a new manifest file on each subsequent run 
and will update the schema of the table in BigQuery if the schema changes in 
your Hudi table.
+#### Benefits of using the new manifest approach:
+<ol>
+       <li>Only the files in the manifest can be scanned leading to less cost 
and better performance for your queries</li>
+       <li>The schema is now synced from the Hudi commit metadata allowing for 
proper schema evolution</li>
+       <li>Lists no longer have unnecessary nesting when querying in BigQuery 
as list inference is enabled by default</li>
+       <li>Partition column no longer needs to be dropped from the files due 
to new schema handling improvements</li>
+</ol>
+
+### View Over Files (Legacy)
+After run, the sync tool will create 2 tables and 1 view in the target dataset 
in BigQuery. The tables and the view

Review Comment:
   can you add a line, this is what was supposed pre 0.14.0. and may be out of 
the box behavior w/ 0.14.0 (depending on if manifest file based is enabled by 
default or not)



##########
website/docs/gcp_bigquery.md:
##########
@@ -5,7 +5,24 @@ summary: Introduce BigQuery integration in Hudi.
 ---
 
 Hudi tables can be queried from [Google Cloud 
BigQuery](https://cloud.google.com/bigquery) as external tables. As of
-now, the Hudi-BigQuery integration only works for hive-style partitioned 
Copy-On-Write tables.
+now, the Hudi-BigQuery integration only works for hive-style partitioned 
Copy-On-Write and Read-Optimized Merge-On-Read tables.
+
+## Sync Modes
+### Manifest File
+As of version 0.14.0, the `BigQuerySyncTool` supports syncing table to 
BigQuery using 
[manifests](https://cloud.google.com/blog/products/data-analytics/bigquery-manifest-file-support-for-open-table-format-queries).
 On the first run, the tool will create a manifest file representing the 
current base files in the table and a table in BigQuery based on the provided 
configurations. The tool produces a new manifest file on each subsequent run 
and will update the schema of the table in BigQuery if the schema changes in 
your Hudi table.

Review Comment:
   can you call out if this is the default behavior. or what config needs to be 
enabled to go this path



##########
website/docs/gcp_bigquery.md:
##########
@@ -17,18 +34,22 @@ setting sync tool class. A few BigQuery-specific 
configurations are required.
 | `hoodie.gcp.bigquery.sync.project_id`        | The target Google Cloud 
project                                                                         
        |
 | `hoodie.gcp.bigquery.sync.dataset_name`      | BigQuery dataset name; create 
before running the sync tool                                                    
  |
 | `hoodie.gcp.bigquery.sync.dataset_location`  | Region info of the dataset; 
same as the GCS bucket that stores the Hudi table                               
    |
-| `hoodie.gcp.bigquery.sync.source_uri`        | A wildcard path pattern 
pointing to the first level partition; make sure to include the partition key.  
        |
+| `hoodie.gcp.bigquery.sync.source_uri`        | A wildcard path pattern 
pointing to the first level partition; partition key can be specified or 
auto-inferred. Only required for partitioned tables |
 | `hoodie.gcp.bigquery.sync.source_uri_prefix` | The common prefix of the 
`source_uri`, usually it's the path to the Hudi table, trailing slash does not 
matter. |
 | `hoodie.gcp.bigquery.sync.base_path`         | The usual basepath config for 
Hudi table.                                                                     
  |
+| `hoodie.gcp.bigquery.sync.use_bq_manifest_file` | Set to true to enable the 
manifest based sync                                                             
   |
 
 Refer to `org.apache.hudi.gcp.bigquery.BigQuerySyncConfig` for the complete 
configuration list.
-
-In addition to the BigQuery-specific configs, set the following Hudi configs 
to write the Hudi table in the desired way.
+### Partition Handling
+In addition to the BigQuery-specific configs, you will need to use hive style 
partitioning for partition pruning in BigQuery. On top of that, the value in 
partition path will be the value returned for that field in your query. For 
example if you partition on a time-millis field, `time`, with an output format 
of `time=yyyy-MM-dd`, the query will return `time` values with day level 
granularity instead of the original milliseconds so keep this in mind while 
setting up your tables.
 
 ```
 hoodie.datasource.write.hive_style_partitioning = 'true'
+```
+
+For the view based sync you must also specify the following for partitioned 
tables to drop the value from your base file:
+```
 hoodie.datasource.write.drop.partition.columns  = 'true'
-hoodie.partition.metafile.use.base.format       = 'true'

Review Comment:
   yeah. each hudi partition has a metadata file called 
.hoodie_partition_metadata. this has to be in parquet format just for BQ sync 
purpose. for other query engines or metastores, we did not have this 
requirement. 
   So, this meta file will be just a text file in general. but if user prefers 
to sync to BQ, the recommendation is to enable this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] nsivabalan commented on a diff in pull request #9710: [HUDI-6857] Update BigQuerySyncTool docs

Reply via email to