Jinpeng Zhou created HUDI-6333:
----------------------------------
Summary: allow using the manifest file with absolute path to
directly create one bigquery external table over the Hudi table
Key: HUDI-6333
URL: https://issues.apache.org/jira/browse/HUDI-6333
Project: Apache Hudi
Issue Type: Improvement
Components: meta-sync
Reporter: Jinpeng Zhou
To query Hudi table from bigquery, the current BigQuerySyncTool creates two
bigquery external tables, one over the data files and the other over a manifest
file that contains the data file name. Based on these two tables, it creates a
view to reflect the latest version of data using the following query: "SELECT *
FROM data_table WHERE _hoodie_file_name IN ( SELECT filename FROM
manifest_file_table)".
The direct reason for such a workaround is that bigquery cannot support
manifest file. However, bigquery is rolling out its manifest file support ,
allowing users to specify manifest file as source uris. Right now the
feature[1] roll-out seems to cover non-partitioned external tables (using hive
parition would return an error "file_set_spec_type option is not supported for
hive partition"), which should be covering partitioned external tables soon.
Given this new bigquery feature, it would be better to update BigQuerySyncTool
correspondingly:
* Allow creating a bigquery compatible manifest file which expects absolute
path of data files. This has been done in HUDI-6254.
* Allow using the new manifest file to create external table directly. This
can be done by issuing one "CREATE EXTERNAL TABLE" query to bigquery.
* Avoid breaking existing user workflows. In case there are some users
relying on the view-based workaround, it probably make sense to keep the
workaround alive at least for now. That would require maintaining two versions
of manifest files.
* Provide a temporary workaround for using bigquery manifest file support till
this feature extends to partitioned table. Since it currently does not support
hive partition, the "CREATE EXTERNAL TABLE" can only create a table over all
the parquet data files without recognizing the partition columns. To keep the
partition columns, a possible workaround is to set the
"hoodie.datasource.write.drop.partition.columns" as false and allow users to
not specify the "hoodie.gcp.bigquery.sync.source_uri_prefix", such that the
partition columns can be written into the parquet files and the
BigQuerySyncTool will not try to create a hive-partitioned external table.
[1]https://cloud.google.com/bigquery/docs/information-schema-table-options#options_table
--
This message was sent by Atlassian Jira
(v8.20.10#820010)