Jinpeng Zhou created HUDI-6333:
----------------------------------

             Summary: allow using the manifest file with absolute path to 
directly create one bigquery external table over the Hudi table
                 Key: HUDI-6333
                 URL: https://issues.apache.org/jira/browse/HUDI-6333
             Project: Apache Hudi
          Issue Type: Improvement
          Components: meta-sync
            Reporter: Jinpeng Zhou


To query Hudi table from bigquery, the current BigQuerySyncTool creates two 
bigquery external tables, one over the data files and the other over a manifest 
file that contains the data file name. Based on these two tables, it creates a 
view to reflect the latest version of data using the following query: "SELECT * 
FROM data_table WHERE _hoodie_file_name IN ( SELECT filename FROM 
manifest_file_table)".

The direct reason for such a workaround is that bigquery cannot support 
manifest file. However, bigquery is rolling out its manifest file support , 
allowing users to specify manifest file as source uris. Right now the 
feature[1] roll-out seems to cover non-partitioned external tables (using hive 
parition would return an error "file_set_spec_type option is not supported for 
hive partition"), which should be covering partitioned external tables soon.

Given this new bigquery feature, it would be better to update BigQuerySyncTool 
correspondingly:
 * Allow creating a bigquery compatible manifest file which expects absolute 
path of data files. This has been done in HUDI-6254.
 * Allow using the new manifest file to create external table directly. This 
can be done by issuing one "CREATE EXTERNAL TABLE" query to bigquery.
 * Avoid breaking existing user workflows.  In case there are some users 
relying on the view-based workaround, it probably make sense to keep the 
workaround alive at least for now. That would require maintaining two versions 
of manifest files.
 * Provide a temporary workaround for using bigquery manifest file support till 
this feature extends to partitioned table. Since it currently does not support 
hive partition, the "CREATE EXTERNAL TABLE" can only create a table over all 
the parquet data files without recognizing the partition columns. To keep the 
partition columns, a possible workaround is to set the 
"hoodie.datasource.write.drop.partition.columns" as false and allow users to 
not specify the "hoodie.gcp.bigquery.sync.source_uri_prefix", such that the 
partition columns can be written into the parquet files and the 
BigQuerySyncTool will not try to create a hive-partitioned external table.

[1]https://cloud.google.com/bigquery/docs/information-schema-table-options#options_table



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to