xushiyan commented on pull request #5125:
URL: https://github.com/apache/hudi/pull/5125#issuecomment-1086715332
## Test setup
- Launch Dataproc 2.0.34-ubuntu18
- From Dataproc instance launch spark-shell
```shell
spark-shell \
--jars gs://xxx/hudi-spark3.1-bundle_2.12-0.11.0-SNAPSHOT.jar \
--packages org.apache.spark:spark-avro_2.12:3.1.2 \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
--conf
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
```
- prepare a partitioned table on GS
```scala
import org.apache.hudi.QuickstartUtils._
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._
val tableName = "hudi_cow_pt_tbl"
spark.sql(
s"""
|create table $tableName (
| id bigint,
| name string,
| ts bigint,
| dt string
|) using hudi
|tblproperties (
| type = 'cow',
| primaryKey = 'id',
| preCombineField = 'ts',
| hoodie.datasource.write.hive_style_partitioning = 'true',
| hoodie.datasource.write.drop.partition.columns = 'true',
| hoodie.metadata.enable = 'false'
| )
|partitioned by (dt)
|location 'gs://foo/bar';
|""")
spark.sql(
s"""
|insert into $tableName partition (dt)
|select 1 as id, 'a1' as name, 1000 as ts, '2021-12-09' as dt;
|""")
```
- Build bundle jars and assembly jar from `hudi-gcp-bundle` module
```shell
mvn -T 2.5C clean install -DskipTests -Dcheckstyle.skip=true -Drat.skip=true
-Djacoco.skip=true -Dmaven.javadoc.skip=true -Dscala12 -Dspark3.1
mvn assembly:single package -pl packaging/hudi-gcp-bundle
```
- Put bundle jars and gcp bundle fat jar in GS bucket
```
gs://xxx/hudi-gcp-bundle-0.11.0-SNAPSHOT-jar-with-dependencies.jar
```
- Go to BigQuery and create a Dataset `mydataset` (set its location to the
same as GS bucket's)
- From Dataproc server submit the sync tool job
```shell
spark-submit --master yarn \
--packages org.apache.spark:spark-avro_2.12:3.1.2 \
--class org.apache.hudi.gcp.bigquery.BigQuerySyncTool \
gs://xxx/hudi-gcp-bundle-0.11.0-SNAPSHOT-jar-with-dependencies.jar \
--project-id myproject \
--dataset-name mydataset \
--dataset-location <location> \
--table-name foobar \
--source-uri gs://foo/bar/dt=* \
--source-uri-prefix gs://foo/bar/ \
--base-path gs://foo/bar \
--partitioned-by dt \
```
- See the job complete from logs
```
22/04/02 19:42:59 INFO
org.apache.hudi.gcp.bigquery.HoodieBigQuerySyncClient: Manifest External table
created.
22/04/02 19:42:59 INFO org.apache.hudi.gcp.bigquery.BigQuerySyncTool:
Manifest table creation complete for 20220402t081216_manifest
22/04/02 19:42:59 INFO
org.apache.hudi.gcp.bigquery.HoodieBigQuerySyncClient: External table created
using hivepartitioningoptions
22/04/02 19:42:59 INFO org.apache.hudi.gcp.bigquery.BigQuerySyncTool:
Versions table creation complete for 20220402t081216_versions
22/04/02 19:43:01 INFO
org.apache.hudi.gcp.bigquery.HoodieBigQuerySyncClient: View created successfully
22/04/02 19:43:01 INFO org.apache.hudi.gcp.bigquery.BigQuerySyncTool:
Snapshot view creation complete for 20220402t081216
22/04/02 19:43:01 INFO org.apache.hudi.gcp.bigquery.BigQuerySyncTool: Sync
table complete for 20220402t081216
```
- See the tables created from bigquery. there should be 2 tables (with
suffix `manifest` and ` versions`) and 1 view created. Query the view for the
hudi table. Before https://issues.apache.org/jira/browse/HUDI-3290 is landed,
manually delete the `.hoodie_partition_metadata` to see the results as a
workaround.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]