[
https://issues.apache.org/jira/browse/HUDI-6333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Danny Chen closed HUDI-6333.
----------------------------
Fix Version/s: 0.14.0
Resolution: Fixed
Fixed via master branch: 5fd9263e21bc15dcf6ed97722975876ad36dd08a
> allow using the manifest file with absolute path to directly create one
> bigquery external table over the Hudi table
> -------------------------------------------------------------------------------------------------------------------
>
> Key: HUDI-6333
> URL: https://issues.apache.org/jira/browse/HUDI-6333
> Project: Apache Hudi
> Issue Type: Improvement
> Components: meta-sync
> Reporter: Jinpeng Zhou
> Priority: Major
> Labels: pull-request-available
> Fix For: 0.14.0
>
>
> To query Hudi table from bigquery, the current BigQuerySyncTool creates two
> bigquery external tables, one over the data files and the other over a
> manifest file that contains the data file name. Based on these two tables, it
> creates a view to reflect the latest version of data using the following
> query: "SELECT * FROM data_table WHERE _hoodie_file_name IN ( SELECT filename
> FROM manifest_file_table)".
> The direct reason for such a workaround is that bigquery cannot support
> manifest file. However, bigquery is rolling out its manifest file support ,
> allowing users to specify manifest file as source uris. Right now the
> feature[1] roll-out seems to cover non-partitioned external tables (using
> hive parition would return an error "file_set_spec_type option is not
> supported for hive partition"), which should be covering partitioned external
> tables soon.
> Given this new bigquery feature, it would be better to update
> BigQuerySyncTool correspondingly:
> * Allow creating a bigquery compatible manifest file which expects absolute
> path of data files. This has been done in HUDI-6254.
> * Allow using the new manifest file to create external table directly. This
> can be done by issuing one "CREATE EXTERNAL TABLE" query to bigquery.
> * Avoid breaking existing user workflows. In case there are some users
> relying on the view-based workaround, it probably make sense to keep the
> workaround alive at least for now. That would require maintaining two
> versions of manifest files.
> * Provide a temporary workaround for using bigquery manifest file support
> till this feature extends to partitioned table. Since it currently does not
> support hive partition, the "CREATE EXTERNAL TABLE" can only create a table
> over all the parquet data files without recognizing the partition columns. To
> keep the partition columns, a possible workaround is to set the
> "hoodie.datasource.write.drop.partition.columns" as false and allow users to
> not specify the "hoodie.gcp.bigquery.sync.source_uri_prefix", such that the
> partition columns can be written into the parquet files and the
> BigQuerySyncTool will not try to create a hive-partitioned external table.
> [1]https://cloud.google.com/bigquery/docs/information-schema-table-options#options_table
--
This message was sent by Atlassian Jira
(v8.20.10#820010)