[GitHub] [hudi] jp0317 opened a new pull request, #8898: [HUDI-6333] allow using manifest file directly to create a bigquery external table

via GitHub Wed, 07 Jun 2023 11:31:15 -0700


jp0317 opened a new pull request, #8898:
URL: https://github.com/apache/hudi/pull/8898


   ### Change Logs
   
   To query Hudi table from bigquery, the current BigQuerySyncTool creates two 
bigquery external tables, one over the data files and the other over a manifest 
file that contains the data file name. Based on these two tables, it creates a 
View to reflect the latest version of data using the following query: "SELECT * 
FROM data_table WHERE _hoodie_file_name IN ( SELECT filename FROM 
manifest_file_table)".
   
   The direct reason for such a workaround is that bigquery cannot support 
manifest file. However, bigquery is rolling out its manifest file support[1] 
now. This feature allows users to use manifest files rather than data files as 
source URIs. Right now the roll-out seems to cover non-partitioned external 
tables (using hive partition would return an error "not supported for hive 
partition"), which should be covering partitioned external tables soon.
   
   Given this new bigquery feature, it would be better to update 
BigQuerySyncTool correspondingly:
   
   1. Allow creating a bigquery compatible manifest file which expects absolute 
path of data files. This has been done in 
[HUDI-6254](https://issues.apache.org/jira/browse/HUDI-6254).
   2. Allow using the new manifest file to create external table directly. This 
can be done by issuing one "CREATE EXTERNAL TABLE" query to bigquery.
   3. Avoid breaking existing user workflows.  In case there are some users 
relying on the view-based workaround, it probably make sense to keep the 
workaround alive at least for now. That would require maintaining two versions 
of manifest files.
   4. Provide a temporary workaround for using bigquery manifest file support 
till this feature extends to partitioned table. The partition columns will not 
be recognized by creating a hive-partitioned external table. A non-partitioned 
external table will only have columns from the parquet data files. To keep the 
partition columns, a workaround is to set the 
"hoodie.datasource.write.drop.partition.columns" as false and allow users to 
not specify the "hoodie.gcp.bigquery.sync.source_uri_prefix", such that the 
partition columns can be written into the parquet files and the 
BigQuerySyncTool will  create a non-partitioned external table. Query this 
external table will produce the same results as  querying the aforementioned 
View.
   
   
[1]https://cloud.google.com/bigquery/docs/information-schema-table-options#options_table
   
   ### Impact
   
   A more efficient way to query Hudi table from bigquery.
   
   
   ### Risk level (write none, low medium or high below)
   
   None. The existing view-based approach will still work.
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
     ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
     changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] jp0317 opened a new pull request, #8898: [HUDI-6333] allow using manifest file directly to create a bigquery external table

Reply via email to