[hudi] branch asf-site updated: [HUDI-3927][DOCS] Add bigquery integration docs (#5444)

xushiyan Wed, 27 Apr 2022 04:43:09 -0700

This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 4057f918f4 [HUDI-3927][DOCS] Add bigquery integration docs (#5444)
4057f918f4 is described below

commit 4057f918f43942ab3917865ba4b439cb5fb19a5c
Author: Raymond Xu <[email protected]>
AuthorDate: Wed Apr 27 04:42:58 2022 -0700

    [HUDI-3927][DOCS] Add bigquery integration docs (#5444)
---
 website/docs/gcp_bigquery.md | 80 ++++++++++++++++++++++++++++++++++++++++++++
 website/sidebars.js          |  1 +
 2 files changed, 81 insertions(+)

diff --git a/website/docs/gcp_bigquery.md b/website/docs/gcp_bigquery.md
new file mode 100644
index 0000000000..93e4505f76
--- /dev/null
+++ b/website/docs/gcp_bigquery.md
@@ -0,0 +1,80 @@
+---
+title: Google Cloud BigQuery
+keywords: [ hudi, gcp, bigquery ]
+summary: Introduce BigQuery integration in Hudi.
+---
+
+Hudi tables can be queried from [Google Cloud 
BigQuery](https://cloud.google.com/bigquery) as external tables. As of
+now, the Hudi-BigQuery integration only works for hive-style partitioned 
Copy-On-Write tables.
+
+## Configurations
+
+Hudi uses `org.apache.hudi.gcp.bigquery.BigQuerySyncTool` to sync tables. It 
works with `HoodieDeltaStreamer` via
+setting sync tool class. A few BigQuery-specific configurations are required.
+
+| Config                                       | Notes                         
                                                                                
  |
+|:---------------------------------------------|:----------------------------------------------------------------------------------------------------------------|
+| `hoodie.gcp.bigquery.sync.project_id`        | The target Google Cloud 
project                                                                         
        |
+| `hoodie.gcp.bigquery.sync.dataset_name`      | BigQuery dataset name; create 
before running the sync tool                                                    
  |
+| `hoodie.gcp.bigquery.sync.dataset_location`  | Region info of the dataset; 
same as the GCS bucket that stores the Hudi table                               
    |
+| `hoodie.gcp.bigquery.sync.source_uri`        | A wildcard path pattern 
pointing to the first level partition; make sure to include the partition key.  
        |
+| `hoodie.gcp.bigquery.sync.source_uri_prefix` | The common prefix of the 
`source_uri`, usually it's the path to the Hudi table, trailing slash does not 
matter. |
+| `hoodie.gcp.bigquery.sync.base_path`         | The usual basepath config for 
Hudi table.                                                                     
  |
+
+Refer to `org.apache.hudi.gcp.bigquery.BigQuerySyncConfig` for the complete 
configuration list.
+
+In addition to the BigQuery-specific configs, set the following Hudi configs 
to write the Hudi table in the desired way.
+
+```
+hoodie.datasource.write.hive_style_partitioning = 'true'
+hoodie.datasource.write.drop.partition.columns  = 'true'
+hoodie.partition.metafile.use.base.format       = 'true'
+```
+
+## Example
+
+Below shows an example for running `BigQuerySyncTool` with 
`HoodieDeltaStreamer`.
+
+```shell
+spark-submit --master yarn \
+--packages com.google.cloud:google-cloud-bigquery:2.10.4 \
+--jars /opt/hudi-gcp-bundle-0.11.0.jar \
+--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
+/opt/hudi-utilities-bundle_2.12-0.11.0.jar \
+--target-base-path gs://my-hoodie-table/path \
+--target-table mytable \
+--table-type COPY_ON_WRITE \
+--base-file-format PARQUET \
+# ... other deltastreamer options
+--enable-sync \
+--sync-tool-classes org.apache.hudi.gcp.bigquery.BigQuerySyncTool \
+--hoodie-conf hoodie.deltastreamer.source.dfs.root=gs://my-source-data/path \
+--hoodie-conf hoodie.gcp.bigquery.sync.project_id=hudi-bq \
+--hoodie-conf hoodie.gcp.bigquery.sync.dataset_name=rxusandbox \
+--hoodie-conf hoodie.gcp.bigquery.sync.dataset_location=asia-southeast1 \
+--hoodie-conf hoodie.gcp.bigquery.sync.table_name=mytable \
+--hoodie-conf 
hoodie.gcp.bigquery.sync.base_path=gs://rxusandbox/testcases/stocks/data/target/${NOW}
 \
+--hoodie-conf hoodie.gcp.bigquery.sync.partition_fields=year,month,day \
+--hoodie-conf 
hoodie.gcp.bigquery.sync.source_uri=gs://my-hoodie-table/path/year=* \
+--hoodie-conf 
hoodie.gcp.bigquery.sync.source_uri_prefix=gs://my-hoodie-table/path/ \
+--hoodie-conf hoodie.gcp.bigquery.sync.use_file_listing_from_metadata=true \
+--hoodie-conf hoodie.gcp.bigquery.sync.assume_date_partitioning=false \
+--hoodie-conf hoodie.datasource.hive_sync.mode=jdbc \
+--hoodie-conf hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://localhost:10000 
\
+--hoodie-conf hoodie.datasource.hive_sync.skip_ro_suffix=true \
+--hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=false \
+--hoodie-conf hoodie.datasource.hive_sync.database=mydataset \
+--hoodie-conf hoodie.datasource.hive_sync.table=mytable \
+--hoodie-conf hoodie.datasource.write.recordkey.field=mykey \
+--hoodie-conf hoodie.datasource.write.partitionpath.field=year,month,day \
+--hoodie-conf hoodie.datasource.write.precombine.field=ts \
+--hoodie-conf hoodie.datasource.write.keygenerator.type=COMPLEX \
+--hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
+--hoodie-conf hoodie.datasource.write.drop.partition.columns=true \
+--hoodie-conf hoodie.partition.metafile.use.base.format=true \
+--hoodie-conf hoodie.metadata.enable=true \
+```
+
+After run, the sync tool will create 2 tables and 1 view in the target dataset 
in BigQuery. The tables and the view
+share the same name prefix, which is taken from the Hudi table name. Query the 
view for the same results as querying the
+Copy-on-Write Hudi table.
diff --git a/website/sidebars.js b/website/sidebars.js
index a7be23c8b0..c970c1ac5c 100644
--- a/website/sidebars.js
+++ b/website/sidebars.js
@@ -48,6 +48,7 @@ module.exports = {
                 'writing_data',
                 'hoodie_deltastreamer',
                 'querying_data',
+                'gcp_bigquery',
                 'flink_configuration',
                 {
                     type: 'category',

[hudi] branch asf-site updated: [HUDI-3927][DOCS] Add bigquery integration docs (#5444)

Reply via email to