[GitHub] [hudi] xushiyan commented on pull request #5125: [HUDI-3357] MVP implementation of BigQuerySyncTool

GitBox Sat, 02 Apr 2022 13:15:29 -0700


xushiyan commented on pull request #5125:
URL: https://github.com/apache/hudi/pull/5125#issuecomment-1086715332



   ## Test setup
   
   - Launch Dataproc 2.0.34-ubuntu18
   - From Dataproc instance launch spark-shell
   ```shell
   spark-shell \
     --jars gs://xxx/hudi-spark3.1-bundle_2.12-0.11.0-SNAPSHOT.jar \
     --packages org.apache.spark:spark-avro_2.12:3.1.2 \
     --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
     --conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
   ```
   - prepare a partitioned table on GS
   ```scala
   import org.apache.hudi.QuickstartUtils._
   import scala.collection.JavaConversions._
   import org.apache.spark.sql.SaveMode._
   import org.apache.hudi.DataSourceReadOptions._
   import org.apache.hudi.DataSourceWriteOptions._
   import org.apache.hudi.config.HoodieWriteConfig._
   
   val tableName = "hudi_cow_pt_tbl"
   
   spark.sql(
     s"""
       |create table $tableName (
       |  id bigint,
       |  name string,
       |  ts bigint,
       |  dt string
       |) using hudi
       |tblproperties (
       |  type = 'cow',
       |  primaryKey = 'id',
       |  preCombineField = 'ts',
       |  hoodie.datasource.write.hive_style_partitioning = 'true',
       |  hoodie.datasource.write.drop.partition.columns = 'true',
       |  hoodie.metadata.enable = 'false'
       | )
       |partitioned by (dt)
       |location 'gs://foo/bar';
       |""")
   spark.sql(
     s"""
       |insert into $tableName partition (dt)
       |select 1 as id, 'a1' as name, 1000 as ts, '2021-12-09' as dt;
       |""")
   ```
   - Build bundle jars and assembly jar from `hudi-gcp-bundle` module
   ```shell
   mvn -T 2.5C clean install -DskipTests -Dcheckstyle.skip=true -Drat.skip=true 
-Djacoco.skip=true -Dmaven.javadoc.skip=true -Dscala12 -Dspark3.1
   mvn assembly:single package -pl packaging/hudi-gcp-bundle
   ```
   - Put bundle jars and gcp bundle fat jar in GS bucket
   ```
   gs://xxx/hudi-gcp-bundle-0.11.0-SNAPSHOT-jar-with-dependencies.jar 
   ```
   - Go to BigQuery and create a Dataset `mydataset` (set its location to the 
same as GS bucket's)
   - From Dataproc server submit the sync tool job
   ```shell
   spark-submit --master yarn \
   --packages org.apache.spark:spark-avro_2.12:3.1.2 \
   --class org.apache.hudi.gcp.bigquery.BigQuerySyncTool \
   gs://xxx/hudi-gcp-bundle-0.11.0-SNAPSHOT-jar-with-dependencies.jar \
   --project-id myproject \
   --dataset-name mydataset \
   --dataset-location <location> \
   --table-name foobar \
   --source-uri gs://foo/bar/dt=* \
   --source-uri-prefix gs://foo/bar/ \
   --base-path gs://foo/bar \
   --partitioned-by dt \
   ```
   - See the job complete from logs
   ```
   22/04/02 19:42:59 INFO 
org.apache.hudi.gcp.bigquery.HoodieBigQuerySyncClient: Manifest External table 
created.
   22/04/02 19:42:59 INFO org.apache.hudi.gcp.bigquery.BigQuerySyncTool: 
Manifest table creation complete for 20220402t081216_manifest
   22/04/02 19:42:59 INFO 
org.apache.hudi.gcp.bigquery.HoodieBigQuerySyncClient: External table created 
using hivepartitioningoptions
   22/04/02 19:42:59 INFO org.apache.hudi.gcp.bigquery.BigQuerySyncTool: 
Versions table creation complete for 20220402t081216_versions
   22/04/02 19:43:01 INFO 
org.apache.hudi.gcp.bigquery.HoodieBigQuerySyncClient: View created successfully
   22/04/02 19:43:01 INFO org.apache.hudi.gcp.bigquery.BigQuerySyncTool: 
Snapshot view creation complete for 20220402t081216
   22/04/02 19:43:01 INFO org.apache.hudi.gcp.bigquery.BigQuerySyncTool: Sync 
table complete for 20220402t081216
   ```
   - See the tables created from bigquery. there should be 2 tables (with 
suffix `manifest` and ` versions`) and 1 view created. Query the view for the 
hudi table. Before https://issues.apache.org/jira/browse/HUDI-3290 is landed, 
manually delete the `.hoodie_partition_metadata` to see the results as a 
workaround.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] xushiyan commented on pull request #5125: [HUDI-3357] MVP implementation of BigQuerySyncTool

Reply via email to