(hudi) branch asf-site updated: [HUDI-4967][HUDI-4834] Improve docs for hive sync and glue sync (#11402)

xushiyan Wed, 05 Jun 2024 18:06:12 -0700

This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 41d1021f8a7 [HUDI-4967][HUDI-4834] Improve docs for hive sync and glue 
sync (#11402)
41d1021f8a7 is described below

commit 41d1021f8a70f9c2f2bdc049e514510b4ea1053e
Author: Shiyan Xu <[email protected]>
AuthorDate: Wed Jun 5 20:05:18 2024 -0500

    [HUDI-4967][HUDI-4834] Improve docs for hive sync and glue sync (#11402)
---
 website/docs/syncing_aws_glue_data_catalog.md |  51 +++++-
 website/docs/syncing_metastore.md             | 235 ++++++++++++++------------
 2 files changed, 176 insertions(+), 110 deletions(-)

diff --git a/website/docs/syncing_aws_glue_data_catalog.md 
b/website/docs/syncing_aws_glue_data_catalog.md
index e54c6d52887..b6f6c82a6c5 100644
--- a/website/docs/syncing_aws_glue_data_catalog.md
+++ b/website/docs/syncing_aws_glue_data_catalog.md
@@ -7,22 +7,61 @@ Hudi tables can sync to AWS Glue Data Catalog directly via 
AWS SDK. Piggyback on
 , `org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool` makes use of all the 
configurations that are taken by `HiveSyncTool`
 and send them to AWS Glue.
 
-### Configurations
+## Configurations
 
-There is no additional configuration for using `AwsGlueCatalogSyncTool`; you 
just need to set it as one of the sync tool
-classes for `HoodieStreamer` and everything configured as shown in [Sync to 
Hive Metastore](syncing_metastore) will
-be passed along.
+Most of the configurations for `AwsGlueCatalogSyncTool` are shared with 
`HiveSyncTool`. The example showed in 
+[Sync to Hive Metastore](syncing_metastore) can be used as is for sync with 
Glue Data Catalog, provided that the hive metastore
+URL (either JDBC or thrift URI) can proxied to Glue Data Catalog, which is 
usually done within AWS EMR or Glue job environment.
+
+For Hudi streamer, users can set
 
 ```shell
 --sync-tool-classes org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool
 ```
 
-#### Running AWS Glue Catalog Sync for Spark DataSource
+For Spark data source writers, users can set
+
+```shell
+hoodie.meta.sync.client.tool.class=org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool
+```
+
+### Avoid creating excessive versions
+
+Tables stored in Glue Data Catalog are versioned. And by default, every Hudi 
commit triggers a sync operation if enabled, regardless of having relevant 
metadata changes.
+This can lead to too many versions kept in the catalog and eventually failing 
the sync operation.
+
+Meta-sync can be set to conditional - only sync when there are schema change 
or partition change. This can avoid creating
+excessive versions in the catalog. Users can enable it by setting 
+
+```
+hoodie.datasource.meta_sync.condition.sync=true
+```
+
+### Glue Data Catalog specific configs
+
+Sync to Glue Data Catalog can be optimized with other configs like
+
+```
+hoodie.datasource.meta.sync.glue.all_partitions_read_parallelism
+hoodie.datasource.meta.sync.glue.changed_partitions_read_parallelism
+hoodie.datasource.meta.sync.glue.partition_change_parallelism
+```
+
+[Partition 
indexes](https://docs.aws.amazon.com/glue/latest/dg/partition-indexes.html) can 
also be used by setting
+
+```
+hoodie.datasource.meta.sync.glue.partition_index_fields.enable
+hoodie.datasource.meta.sync.glue.partition_index_fields
+```
+
+## Other references
+
+### Running AWS Glue Catalog Sync for Spark DataSource
 
 To write a Hudi table to Amazon S3 and catalog it in AWS Glue Data Catalog, 
you can use the options mentioned in the
 [AWS 
documentation](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-hudi.html#aws-glue-programming-etl-format-hudi-write)
 
-#### Running AWS Glue Catalog Sync from EMR
+### Running AWS Glue Catalog Sync from EMR
 
 If you're running HiveSyncTool on an EMR cluster backed by Glue Data Catalog 
as external metastore, you can simply run the sync from command line like below:
 
diff --git a/website/docs/syncing_metastore.md 
b/website/docs/syncing_metastore.md
index e39c5f39337..2aada772a6a 100644
--- a/website/docs/syncing_metastore.md
+++ b/website/docs/syncing_metastore.md
@@ -10,6 +10,118 @@ Hive metastore as well. This unlocks the capability to 
query Hudi tables not onl
 interactive query engines such as Presto and Trino. In this document, we will 
go through different ways to sync the Hudi
 table to Hive metastore.
 
+## Spark Data Source example
+
+Prerequisites: setup hive metastore properly and configure the Spark 
installation to point to the hive metastore by placing `hive-site.xml` under 
`$SPARK_HOME/conf`
+
+Assume that
+  - hiveserver2 is running at port 10000
+  - metastore is running at port 9083
+
+Then start a spark-shell with Hudi spark bundle jar as a dependency (refer to 
Quickstart example)
+
+We can run the following script to create a sample hudi table and sync it to 
hive.
+
+```scala
+// spark-shell
+import org.apache.hudi.QuickstartUtils._
+import scala.collection.JavaConversions._
+import org.apache.spark.sql.SaveMode._
+import org.apache.hudi.DataSourceReadOptions._
+import org.apache.hudi.DataSourceWriteOptions._
+import org.apache.hudi.config.HoodieWriteConfig._
+import org.apache.spark.sql.types._
+import org.apache.spark.sql.Row
+
+
+val databaseName = "my_db"
+val tableName = "hudi_cow"
+val basePath = "/user/hive/warehouse/hudi_cow"
+
+val schema = StructType(Array(
+StructField("rowId", StringType,true),
+StructField("partitionId", StringType,true),
+StructField("preComb", LongType,true),
+StructField("name", StringType,true),
+StructField("versionId", StringType,true),
+StructField("toBeDeletedStr", StringType,true),
+StructField("intToLong", IntegerType,true),
+StructField("longToInt", LongType,true)
+))
+
+val data0 = Seq(Row("row_1", 
"2021/01/01",0L,"bob","v_0","toBeDel0",0,1000000L), 
+               Row("row_2", 
"2021/01/01",0L,"john","v_0","toBeDel0",0,1000000L), 
+               Row("row_3", "2021/01/02",0L,"tom","v_0","toBeDel0",0,1000000L))
+
+var dfFromData0 = spark.createDataFrame(data0,schema)
+
+dfFromData0.write.format("hudi").
+  options(getQuickstartWriteConfigs).
+  option("hoodie.datasource.write.precombine.field", "preComb").
+  option("hoodie.datasource.write.recordkey.field", "rowId").
+  option("hoodie.datasource.write.partitionpath.field", "partitionId").
+  option("hoodie.database.name", databaseName).
+  option("hoodie.table.name", tableName).
+  option("hoodie.datasource.write.table.type", "COPY_ON_WRITE").
+  option("hoodie.datasource.write.operation", "upsert").
+  option("hoodie.datasource.write.hive_style_partitioning","true").
+  option("hoodie.datasource.meta.sync.enable", "true").
+  option("hoodie.datasource.hive_sync.mode", "hms").
+  option("hoodie.datasource.hive_sync.metastore.uris", 
"thrift://hive-metastore:9083").
+  mode(Overwrite).
+  save(basePath)
+```
+
+:::note
+If prefer to use JDBC instead of HMS sync mode, omit 
`hoodie.datasource.hive_sync.metastore.uris` and configure these instead
+
+```
+hoodie.datasource.hive_sync.mode=jdbc
+hoodie.datasource.hive_sync.jdbcurl=<e.g., jdbc:hive2://hiveserver:10000>
+hoodie.datasource.hive_sync.username=<username>
+hoodie.datasource.hive_sync.password=<password>
+```
+:::
+
+### Query using HiveQL
+
+```
+beeline -u jdbc:hive2://hiveserver:10000/my_db \
+  --hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat \
+  --hiveconf hive.stats.autogather=false
+  
+Beeline version 1.2.1.spark2 by Apache Hive
+0: jdbc:hive2://hiveserver:10000> show tables;
++-----------+--+
+| tab_name  |
++-----------+--+
+| hudi_cow  |
++-----------+--+
+1 row selected (0.531 seconds)
+0: jdbc:hive2://hiveserver:10000> select * from hudi_cow limit 1;
++-------------------------------+--------------------------------+------------------------------+----------------------------------+----------------------------------------------------------------------------+-----------------+-------------------+----------------+---------------------+--------------------------+---------------------+---------------------+-----------------------+--+
+| hudi_cow._hoodie_commit_time  | hudi_cow._hoodie_commit_seqno  | 
hudi_cow._hoodie_record_key  | hudi_cow._hoodie_partition_path  |               
          hudi_cow._hoodie_file_name                         | hudi_cow.rowid  
| hudi_cow.precomb  | hudi_cow.name  | hudi_cow.versionid  | 
hudi_cow.tobedeletedstr  | hudi_cow.inttolong  | hudi_cow.longtoint  | 
hudi_cow.partitionid  |
++-------------------------------+--------------------------------+------------------------------+----------------------------------+----------------------------------------------------------------------------+-----------------+-------------------+----------------+---------------------+--------------------------+---------------------+---------------------+-----------------------+--+
+| 20220120090023631             | 20220120090023631_1_2          | row_1       
                 | partitionId=2021/01/01           | 
0bf9b822-928f-4a57-950a-6a5450319c83-0_1-24-314_20220120090023631.parquet  | 
row_1           | 0                 | bob            | v_0                 | 
toBeDel0                 | 0                   | 1000000             | 
2021/01/01            |
++-------------------------------+--------------------------------+------------------------------+----------------------------------+----------------------------------------------------------------------------+-----------------+-------------------+----------------+---------------------+--------------------------+---------------------+---------------------+-----------------------+--+
+1 row selected (5.475 seconds)
+0: jdbc:hive2://hiveserver:10000>
+```
+
+### Use partition extractor properly
+
+When sync to hive metastore, partition values are extracted using 
`hoodie.datasource.hive_sync.partition_value_extractor`. Before 0.12, this is 
by default set to 
+`org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor` and users 
usually need to overwrite this manually. Since 0.12, the default value is 
changed to a more
+generic `org.apache.hudi.hive.MultiPartKeysValueExtractor` which extracts 
partition values using `/` as the separator.
+
+In case of using some key generator like `TimestampBasedKeyGenerator`, the 
partition values can be in form of `yyyy/MM/dd`. This is usually undesirable to 
have the
+partition values extracted as multiple parts like `[yyyy, MM, dd]`. Users can 
set `org.apache.hudi.hive.SinglePartPartitionValueExtractor` to extract the 
partition
+values as `yyyy-MM-dd`.
+
+When the table is not partitioned, 
`org.apache.hudi.hive.NonPartitionedExtractor` should be set. And this is 
automatically inferred from partition fields configs,
+so users may not need to set it manually. Similarly, if hive-style 
partitioning is used for the table, then 
`org.apache.hudi.hive.HiveStylePartitionValueExtractor`
+will be inferred and set automatically.
+
 ## Hive Sync Tool
 
 Writing data with [DataSource](/docs/writing_data) writer or 
[HoodieStreamer](/docs/hoodie_streaming_ingestion) supports syncing of the 
table's latest schema to Hive metastore, such that queries can pick up new 
columns and partitions.
@@ -36,7 +148,7 @@ Among them, following are the required arguments:
 ```java
 @Parameter(names = {"--database"}, description = "name of the target database 
in Hive", required = true);
 @Parameter(names = {"--table"}, description = "name of the target table in 
Hive", required = true);
-@Parameter(names = {"--base-path"}, description = "Basepath of hoodie table to 
sync", required = true);## Sync modes
+@Parameter(names = {"--base-path"}, description = "Basepath of Hudi table to 
sync", required = true);
 ```
 Corresponding datasource options for the most commonly used hive sync configs 
are as follows:
 
@@ -46,7 +158,7 @@ In the table below **(N/A)** means there is no default value 
set.
 
 | HiveSyncConfig | DataSourceWriteOption | Default Value | Description |
 | -----------   | ----------- | ----------- | ----------- |
-| --database       | hoodie.datasource.hive_sync.database  |  default   | Name 
of the target database in Hive       |
+| --database       | hoodie.datasource.hive_sync.database  |  default   | Name 
of the target database in Hive metastore       |
 | --table   | hoodie.datasource.hive_sync.table |  (N/A)     | Name of the 
target table in Hive. Inferred from the table name in Hudi table config if not 
specified.        |
 | --user   | hoodie.datasource.hive_sync.username |   hive     | Username for 
hive metastore        | 
 | --pass   | hoodie.datasource.hive_sync.password  |  hive    | Password for 
hive metastore        | 
@@ -62,9 +174,11 @@ In the table below **(N/A)** means there is no default 
value set.
 These modes are just three different ways of executing DDL against Hive. Among 
these modes, JDBC or HMS is preferable over
 HIVEQL, which is mostly used for running DML rather than DDL.
 
-> Note: All these modes assume that hive metastore has been configured and the 
corresponding properties set in 
-> hive-site.xml configuration file. Additionally, if you're using 
spark-shell/spark-sql to sync Hudi table to Hive then 
-> the hive-site.xml file also needs to be placed under `<SPARK_HOME>/conf` 
directory.
+:::note
+All these modes assume that hive metastore has been configured and the 
corresponding properties set in 
+`hive-site.xml` configuration file. Additionally, if you're using 
spark-shell/spark-sql to sync Hudi table to Hive then 
+the `hive-site.xml` file also needs to be placed under `<SPARK_HOME>/conf` 
directory.
+:::
 
 #### HMS
 
@@ -73,23 +187,23 @@ To use this mode, pass `--sync-mode=hms` to 
`run_sync_tool` and set `--use-jdbc=
 Additionally, if you are using remote metastore, then `hive.metastore.uris` 
need to be set in hive-site.xml configuration file.
 Otherwise, the tool assumes that metastore is running locally on port 9083 by 
default. 
 
-#### HIVEQL
-
-HQL is Hive's own SQL dialect. 
-This mode simply uses the Hive QL's 
[driver](https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/Driver.java)
 to execute DDL as HQL command.
-To use this mode, pass `--sync-mode=hiveql` to `run_sync_tool` and set 
`--use-jdbc=false`.
-
 #### JDBC
 
 This mode uses the JDBC specification to connect to the hive metastore. 
-To use this mode, just pass the jdbc url to the hive server (`--use-jdbc` is 
true by default).
+
 ```java
 @Parameter(names = {"--jdbc-url"}, description = "Hive jdbc connect url");
 ```
 
-### Flink Setup
+#### HIVEQL
+
+HQL is Hive's own SQL dialect. 
+This mode simply uses the Hive QL's 
[driver](https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/Driver.java)
 to execute DDL as HQL command.
+To use this mode, pass `--sync-mode=hiveql` to `run_sync_tool` and set 
`--use-jdbc=false`.
+
+## Flink Setup
 
-#### Install
+### Install
 
 Now you can git clone Hudi master branch to test Flink hive sync. The first 
step is to install Hudi to get `hudi-flink1.1x-bundle-0.x.x.jar`.
  `hudi-flink-bundle` module pom.xml sets the scope related to hive as 
`provided` by default. If you want to use hive sync, you need to use the
@@ -112,7 +226,7 @@ If using hive profile, you need to modify the hive version 
in the profile to you
 The location of this `pom.xml` is `packaging/hudi-flink-bundle/pom.xml`, and 
the corresponding profile is at the bottom of this file.
 :::
 
-#### Hive Environment
+### Hive Environment
 
 1. Import `hudi-hadoop-mr-bundle` into hive. Creating `auxlib/` folder under 
the root directory of hive, and moving 
`hudi-hadoop-mr-bundle-0.x.x-SNAPSHOT.jar` into `auxlib`.
    `hudi-hadoop-mr-bundle-0.x.x-SNAPSHOT.jar` is at 
`packaging/hudi-hadoop-mr-bundle/target`.
@@ -128,7 +242,7 @@ nohup ./bin/hive --service hiveserver2 &
 # While modifying the jar package under auxlib, you need to restart the 
service.
 ```
 
-#### Sync Template
+### Sync Template
 
 Flink hive sync now supports two hive sync mode, `hms` and `jdbc`. `hms` mode 
only needs to configure metastore uris. For
 the `jdbc` mode, the JDBC attributes and metastore uris both need to be 
configured. The options template is as below:
@@ -177,96 +291,9 @@ WITH (
 );
 ```
 
-#### Query
+### Query
 
 While using hive beeline query, you need to enter settings:
 ```bash
 set hive.input.format = 
org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat;
 ```
-
-### Spark datasource example
-
-Assuming the metastore is configured properly, then start the spark-shell.
-
-```
-$SPARK_INSTALL/bin/spark-shell   --jars $HUDI_SPARK_BUNDLE \
-  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
-```
-
-We can run the following script to create a sample hudi table and sync it to 
hive.
-
-```scala
-// spark-shell
-import org.apache.hudi.QuickstartUtils._
-import scala.collection.JavaConversions._
-import org.apache.spark.sql.SaveMode._
-import org.apache.hudi.DataSourceReadOptions._
-import org.apache.hudi.DataSourceWriteOptions._
-import org.apache.hudi.config.HoodieWriteConfig._
-import org.apache.spark.sql.types._
-import org.apache.spark.sql.Row
-
-
-val tableName = "hudi_cow"
-val basePath = "/user/hive/warehouse/hudi_cow"
-
-val schema = StructType(Array(
-StructField("rowId", StringType,true),
-StructField("partitionId", StringType,true),
-StructField("preComb", LongType,true),
-StructField("name", StringType,true),
-StructField("versionId", StringType,true),
-StructField("toBeDeletedStr", StringType,true),
-StructField("intToLong", IntegerType,true),
-StructField("longToInt", LongType,true)
-))
-
-val data0 = Seq(Row("row_1", 
"2021/01/01",0L,"bob","v_0","toBeDel0",0,1000000L), 
-               Row("row_2", 
"2021/01/01",0L,"john","v_0","toBeDel0",0,1000000L), 
-               Row("row_3", "2021/01/02",0L,"tom","v_0","toBeDel0",0,1000000L))
-
-var dfFromData0 = spark.createDataFrame(data0,schema)
-
-dfFromData0.write.format("hudi").
-  options(getQuickstartWriteConfigs).
-  option("hoodie.datasource.write.precombine.field", "preComb").
-  option("hoodie.datasource.write.recordkey.field", "rowId").
-  option("hoodie.datasource.write.partitionpath.field", "partitionId").
-  option("hoodie.table.name", tableName).
-  option("hoodie.datasource.write.table.type", 'COPY_ON_WRITE').
-  option("hoodie.datasource.write.operation", "upsert").
-  option("hoodie.index.type","SIMPLE").
-  option("hoodie.datasource.write.hive_style_partitioning","true").
-  
option("hoodie.datasource.hive_sync.jdbcurl","jdbc:hive2://hiveserver:10000/").
-  option("hoodie.datasource.hive_sync.database","default").
-  option("hoodie.datasource.hive_sync.table","hudi_cow").
-  option("hoodie.datasource.hive_sync.partition_fields","partitionId").
-  option("hoodie.datasource.hive_sync.enable","true").
-  mode(Overwrite).
-  save(basePath)
-```
-
-To query, connect to the hive server.
-
-```
-beeline -u jdbc:hive2://hiveserver:10000 \
-  --hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat \
-  --hiveconf hive.stats.autogather=false
-  
-Beeline version 1.2.1.spark2 by Apache Hive
-0: jdbc:hive2://hiveserver:10000> show tables;
-+-----------+--+
-| tab_name  |
-+-----------+--+
-| hudi_cow  |
-+-----------+--+
-1 row selected (0.531 seconds)
-0: jdbc:hive2://hiveserver:10000> select * from hudi_cow limit 1;
-+-------------------------------+--------------------------------+------------------------------+----------------------------------+----------------------------------------------------------------------------+-----------------+-------------------+----------------+---------------------+--------------------------+---------------------+---------------------+-----------------------+--+
-| hudi_cow._hoodie_commit_time  | hudi_cow._hoodie_commit_seqno  | 
hudi_cow._hoodie_record_key  | hudi_cow._hoodie_partition_path  |               
          hudi_cow._hoodie_file_name                         | hudi_cow.rowid  
| hudi_cow.precomb  | hudi_cow.name  | hudi_cow.versionid  | 
hudi_cow.tobedeletedstr  | hudi_cow.inttolong  | hudi_cow.longtoint  | 
hudi_cow.partitionid  |
-+-------------------------------+--------------------------------+------------------------------+----------------------------------+----------------------------------------------------------------------------+-----------------+-------------------+----------------+---------------------+--------------------------+---------------------+---------------------+-----------------------+--+
-| 20220120090023631             | 20220120090023631_1_2          | row_1       
                 | partitionId=2021/01/01           | 
0bf9b822-928f-4a57-950a-6a5450319c83-0_1-24-314_20220120090023631.parquet  | 
row_1           | 0                 | bob            | v_0                 | 
toBeDel0                 | 0                   | 1000000             | 
2021/01/01            |
-+-------------------------------+--------------------------------+------------------------------+----------------------------------+----------------------------------------------------------------------------+-----------------+-------------------+----------------+---------------------+--------------------------+---------------------+---------------------+-----------------------+--+
-1 row selected (5.475 seconds)
-0: jdbc:hive2://hiveserver:10000>
-```

(hudi) branch asf-site updated: [HUDI-4967][HUDI-4834] Improve docs for hive sync and glue sync (#11402)

Reply via email to