This is an automated email from the ASF dual-hosted git repository.
xushiyan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 41d1021f8a7 [HUDI-4967][HUDI-4834] Improve docs for hive sync and glue
sync (#11402)
41d1021f8a7 is described below
commit 41d1021f8a70f9c2f2bdc049e514510b4ea1053e
Author: Shiyan Xu <[email protected]>
AuthorDate: Wed Jun 5 20:05:18 2024 -0500
[HUDI-4967][HUDI-4834] Improve docs for hive sync and glue sync (#11402)
---
website/docs/syncing_aws_glue_data_catalog.md | 51 +++++-
website/docs/syncing_metastore.md | 235 ++++++++++++++------------
2 files changed, 176 insertions(+), 110 deletions(-)
diff --git a/website/docs/syncing_aws_glue_data_catalog.md
b/website/docs/syncing_aws_glue_data_catalog.md
index e54c6d52887..b6f6c82a6c5 100644
--- a/website/docs/syncing_aws_glue_data_catalog.md
+++ b/website/docs/syncing_aws_glue_data_catalog.md
@@ -7,22 +7,61 @@ Hudi tables can sync to AWS Glue Data Catalog directly via
AWS SDK. Piggyback on
, `org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool` makes use of all the
configurations that are taken by `HiveSyncTool`
and send them to AWS Glue.
-### Configurations
+## Configurations
-There is no additional configuration for using `AwsGlueCatalogSyncTool`; you
just need to set it as one of the sync tool
-classes for `HoodieStreamer` and everything configured as shown in [Sync to
Hive Metastore](syncing_metastore) will
-be passed along.
+Most of the configurations for `AwsGlueCatalogSyncTool` are shared with
`HiveSyncTool`. The example showed in
+[Sync to Hive Metastore](syncing_metastore) can be used as is for sync with
Glue Data Catalog, provided that the hive metastore
+URL (either JDBC or thrift URI) can proxied to Glue Data Catalog, which is
usually done within AWS EMR or Glue job environment.
+
+For Hudi streamer, users can set
```shell
--sync-tool-classes org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool
```
-#### Running AWS Glue Catalog Sync for Spark DataSource
+For Spark data source writers, users can set
+
+```shell
+hoodie.meta.sync.client.tool.class=org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool
+```
+
+### Avoid creating excessive versions
+
+Tables stored in Glue Data Catalog are versioned. And by default, every Hudi
commit triggers a sync operation if enabled, regardless of having relevant
metadata changes.
+This can lead to too many versions kept in the catalog and eventually failing
the sync operation.
+
+Meta-sync can be set to conditional - only sync when there are schema change
or partition change. This can avoid creating
+excessive versions in the catalog. Users can enable it by setting
+
+```
+hoodie.datasource.meta_sync.condition.sync=true
+```
+
+### Glue Data Catalog specific configs
+
+Sync to Glue Data Catalog can be optimized with other configs like
+
+```
+hoodie.datasource.meta.sync.glue.all_partitions_read_parallelism
+hoodie.datasource.meta.sync.glue.changed_partitions_read_parallelism
+hoodie.datasource.meta.sync.glue.partition_change_parallelism
+```
+
+[Partition
indexes](https://docs.aws.amazon.com/glue/latest/dg/partition-indexes.html) can
also be used by setting
+
+```
+hoodie.datasource.meta.sync.glue.partition_index_fields.enable
+hoodie.datasource.meta.sync.glue.partition_index_fields
+```
+
+## Other references
+
+### Running AWS Glue Catalog Sync for Spark DataSource
To write a Hudi table to Amazon S3 and catalog it in AWS Glue Data Catalog,
you can use the options mentioned in the
[AWS
documentation](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-hudi.html#aws-glue-programming-etl-format-hudi-write)
-#### Running AWS Glue Catalog Sync from EMR
+### Running AWS Glue Catalog Sync from EMR
If you're running HiveSyncTool on an EMR cluster backed by Glue Data Catalog
as external metastore, you can simply run the sync from command line like below:
diff --git a/website/docs/syncing_metastore.md
b/website/docs/syncing_metastore.md
index e39c5f39337..2aada772a6a 100644
--- a/website/docs/syncing_metastore.md
+++ b/website/docs/syncing_metastore.md
@@ -10,6 +10,118 @@ Hive metastore as well. This unlocks the capability to
query Hudi tables not onl
interactive query engines such as Presto and Trino. In this document, we will
go through different ways to sync the Hudi
table to Hive metastore.
+## Spark Data Source example
+
+Prerequisites: setup hive metastore properly and configure the Spark
installation to point to the hive metastore by placing `hive-site.xml` under
`$SPARK_HOME/conf`
+
+Assume that
+ - hiveserver2 is running at port 10000
+ - metastore is running at port 9083
+
+Then start a spark-shell with Hudi spark bundle jar as a dependency (refer to
Quickstart example)
+
+We can run the following script to create a sample hudi table and sync it to
hive.
+
+```scala
+// spark-shell
+import org.apache.hudi.QuickstartUtils._
+import scala.collection.JavaConversions._
+import org.apache.spark.sql.SaveMode._
+import org.apache.hudi.DataSourceReadOptions._
+import org.apache.hudi.DataSourceWriteOptions._
+import org.apache.hudi.config.HoodieWriteConfig._
+import org.apache.spark.sql.types._
+import org.apache.spark.sql.Row
+
+
+val databaseName = "my_db"
+val tableName = "hudi_cow"
+val basePath = "/user/hive/warehouse/hudi_cow"
+
+val schema = StructType(Array(
+StructField("rowId", StringType,true),
+StructField("partitionId", StringType,true),
+StructField("preComb", LongType,true),
+StructField("name", StringType,true),
+StructField("versionId", StringType,true),
+StructField("toBeDeletedStr", StringType,true),
+StructField("intToLong", IntegerType,true),
+StructField("longToInt", LongType,true)
+))
+
+val data0 = Seq(Row("row_1",
"2021/01/01",0L,"bob","v_0","toBeDel0",0,1000000L),
+ Row("row_2",
"2021/01/01",0L,"john","v_0","toBeDel0",0,1000000L),
+ Row("row_3", "2021/01/02",0L,"tom","v_0","toBeDel0",0,1000000L))
+
+var dfFromData0 = spark.createDataFrame(data0,schema)
+
+dfFromData0.write.format("hudi").
+ options(getQuickstartWriteConfigs).
+ option("hoodie.datasource.write.precombine.field", "preComb").
+ option("hoodie.datasource.write.recordkey.field", "rowId").
+ option("hoodie.datasource.write.partitionpath.field", "partitionId").
+ option("hoodie.database.name", databaseName).
+ option("hoodie.table.name", tableName).
+ option("hoodie.datasource.write.table.type", "COPY_ON_WRITE").
+ option("hoodie.datasource.write.operation", "upsert").
+ option("hoodie.datasource.write.hive_style_partitioning","true").
+ option("hoodie.datasource.meta.sync.enable", "true").
+ option("hoodie.datasource.hive_sync.mode", "hms").
+ option("hoodie.datasource.hive_sync.metastore.uris",
"thrift://hive-metastore:9083").
+ mode(Overwrite).
+ save(basePath)
+```
+
+:::note
+If prefer to use JDBC instead of HMS sync mode, omit
`hoodie.datasource.hive_sync.metastore.uris` and configure these instead
+
+```
+hoodie.datasource.hive_sync.mode=jdbc
+hoodie.datasource.hive_sync.jdbcurl=<e.g., jdbc:hive2://hiveserver:10000>
+hoodie.datasource.hive_sync.username=<username>
+hoodie.datasource.hive_sync.password=<password>
+```
+:::
+
+### Query using HiveQL
+
+```
+beeline -u jdbc:hive2://hiveserver:10000/my_db \
+ --hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat \
+ --hiveconf hive.stats.autogather=false
+
+Beeline version 1.2.1.spark2 by Apache Hive
+0: jdbc:hive2://hiveserver:10000> show tables;
++-----------+--+
+| tab_name |
++-----------+--+
+| hudi_cow |
++-----------+--+
+1 row selected (0.531 seconds)
+0: jdbc:hive2://hiveserver:10000> select * from hudi_cow limit 1;
++-------------------------------+--------------------------------+------------------------------+----------------------------------+----------------------------------------------------------------------------+-----------------+-------------------+----------------+---------------------+--------------------------+---------------------+---------------------+-----------------------+--+
+| hudi_cow._hoodie_commit_time | hudi_cow._hoodie_commit_seqno |
hudi_cow._hoodie_record_key | hudi_cow._hoodie_partition_path |
hudi_cow._hoodie_file_name | hudi_cow.rowid
| hudi_cow.precomb | hudi_cow.name | hudi_cow.versionid |
hudi_cow.tobedeletedstr | hudi_cow.inttolong | hudi_cow.longtoint |
hudi_cow.partitionid |
++-------------------------------+--------------------------------+------------------------------+----------------------------------+----------------------------------------------------------------------------+-----------------+-------------------+----------------+---------------------+--------------------------+---------------------+---------------------+-----------------------+--+
+| 20220120090023631 | 20220120090023631_1_2 | row_1
| partitionId=2021/01/01 |
0bf9b822-928f-4a57-950a-6a5450319c83-0_1-24-314_20220120090023631.parquet |
row_1 | 0 | bob | v_0 |
toBeDel0 | 0 | 1000000 |
2021/01/01 |
++-------------------------------+--------------------------------+------------------------------+----------------------------------+----------------------------------------------------------------------------+-----------------+-------------------+----------------+---------------------+--------------------------+---------------------+---------------------+-----------------------+--+
+1 row selected (5.475 seconds)
+0: jdbc:hive2://hiveserver:10000>
+```
+
+### Use partition extractor properly
+
+When sync to hive metastore, partition values are extracted using
`hoodie.datasource.hive_sync.partition_value_extractor`. Before 0.12, this is
by default set to
+`org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor` and users
usually need to overwrite this manually. Since 0.12, the default value is
changed to a more
+generic `org.apache.hudi.hive.MultiPartKeysValueExtractor` which extracts
partition values using `/` as the separator.
+
+In case of using some key generator like `TimestampBasedKeyGenerator`, the
partition values can be in form of `yyyy/MM/dd`. This is usually undesirable to
have the
+partition values extracted as multiple parts like `[yyyy, MM, dd]`. Users can
set `org.apache.hudi.hive.SinglePartPartitionValueExtractor` to extract the
partition
+values as `yyyy-MM-dd`.
+
+When the table is not partitioned,
`org.apache.hudi.hive.NonPartitionedExtractor` should be set. And this is
automatically inferred from partition fields configs,
+so users may not need to set it manually. Similarly, if hive-style
partitioning is used for the table, then
`org.apache.hudi.hive.HiveStylePartitionValueExtractor`
+will be inferred and set automatically.
+
## Hive Sync Tool
Writing data with [DataSource](/docs/writing_data) writer or
[HoodieStreamer](/docs/hoodie_streaming_ingestion) supports syncing of the
table's latest schema to Hive metastore, such that queries can pick up new
columns and partitions.
@@ -36,7 +148,7 @@ Among them, following are the required arguments:
```java
@Parameter(names = {"--database"}, description = "name of the target database
in Hive", required = true);
@Parameter(names = {"--table"}, description = "name of the target table in
Hive", required = true);
-@Parameter(names = {"--base-path"}, description = "Basepath of hoodie table to
sync", required = true);## Sync modes
+@Parameter(names = {"--base-path"}, description = "Basepath of Hudi table to
sync", required = true);
```
Corresponding datasource options for the most commonly used hive sync configs
are as follows:
@@ -46,7 +158,7 @@ In the table below **(N/A)** means there is no default value
set.
| HiveSyncConfig | DataSourceWriteOption | Default Value | Description |
| ----------- | ----------- | ----------- | ----------- |
-| --database | hoodie.datasource.hive_sync.database | default | Name
of the target database in Hive |
+| --database | hoodie.datasource.hive_sync.database | default | Name
of the target database in Hive metastore |
| --table | hoodie.datasource.hive_sync.table | (N/A) | Name of the
target table in Hive. Inferred from the table name in Hudi table config if not
specified. |
| --user | hoodie.datasource.hive_sync.username | hive | Username for
hive metastore |
| --pass | hoodie.datasource.hive_sync.password | hive | Password for
hive metastore |
@@ -62,9 +174,11 @@ In the table below **(N/A)** means there is no default
value set.
These modes are just three different ways of executing DDL against Hive. Among
these modes, JDBC or HMS is preferable over
HIVEQL, which is mostly used for running DML rather than DDL.
-> Note: All these modes assume that hive metastore has been configured and the
corresponding properties set in
-> hive-site.xml configuration file. Additionally, if you're using
spark-shell/spark-sql to sync Hudi table to Hive then
-> the hive-site.xml file also needs to be placed under `<SPARK_HOME>/conf`
directory.
+:::note
+All these modes assume that hive metastore has been configured and the
corresponding properties set in
+`hive-site.xml` configuration file. Additionally, if you're using
spark-shell/spark-sql to sync Hudi table to Hive then
+the `hive-site.xml` file also needs to be placed under `<SPARK_HOME>/conf`
directory.
+:::
#### HMS
@@ -73,23 +187,23 @@ To use this mode, pass `--sync-mode=hms` to
`run_sync_tool` and set `--use-jdbc=
Additionally, if you are using remote metastore, then `hive.metastore.uris`
need to be set in hive-site.xml configuration file.
Otherwise, the tool assumes that metastore is running locally on port 9083 by
default.
-#### HIVEQL
-
-HQL is Hive's own SQL dialect.
-This mode simply uses the Hive QL's
[driver](https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/Driver.java)
to execute DDL as HQL command.
-To use this mode, pass `--sync-mode=hiveql` to `run_sync_tool` and set
`--use-jdbc=false`.
-
#### JDBC
This mode uses the JDBC specification to connect to the hive metastore.
-To use this mode, just pass the jdbc url to the hive server (`--use-jdbc` is
true by default).
+
```java
@Parameter(names = {"--jdbc-url"}, description = "Hive jdbc connect url");
```
-### Flink Setup
+#### HIVEQL
+
+HQL is Hive's own SQL dialect.
+This mode simply uses the Hive QL's
[driver](https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/Driver.java)
to execute DDL as HQL command.
+To use this mode, pass `--sync-mode=hiveql` to `run_sync_tool` and set
`--use-jdbc=false`.
+
+## Flink Setup
-#### Install
+### Install
Now you can git clone Hudi master branch to test Flink hive sync. The first
step is to install Hudi to get `hudi-flink1.1x-bundle-0.x.x.jar`.
`hudi-flink-bundle` module pom.xml sets the scope related to hive as
`provided` by default. If you want to use hive sync, you need to use the
@@ -112,7 +226,7 @@ If using hive profile, you need to modify the hive version
in the profile to you
The location of this `pom.xml` is `packaging/hudi-flink-bundle/pom.xml`, and
the corresponding profile is at the bottom of this file.
:::
-#### Hive Environment
+### Hive Environment
1. Import `hudi-hadoop-mr-bundle` into hive. Creating `auxlib/` folder under
the root directory of hive, and moving
`hudi-hadoop-mr-bundle-0.x.x-SNAPSHOT.jar` into `auxlib`.
`hudi-hadoop-mr-bundle-0.x.x-SNAPSHOT.jar` is at
`packaging/hudi-hadoop-mr-bundle/target`.
@@ -128,7 +242,7 @@ nohup ./bin/hive --service hiveserver2 &
# While modifying the jar package under auxlib, you need to restart the
service.
```
-#### Sync Template
+### Sync Template
Flink hive sync now supports two hive sync mode, `hms` and `jdbc`. `hms` mode
only needs to configure metastore uris. For
the `jdbc` mode, the JDBC attributes and metastore uris both need to be
configured. The options template is as below:
@@ -177,96 +291,9 @@ WITH (
);
```
-#### Query
+### Query
While using hive beeline query, you need to enter settings:
```bash
set hive.input.format =
org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat;
```
-
-### Spark datasource example
-
-Assuming the metastore is configured properly, then start the spark-shell.
-
-```
-$SPARK_INSTALL/bin/spark-shell --jars $HUDI_SPARK_BUNDLE \
- --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
-```
-
-We can run the following script to create a sample hudi table and sync it to
hive.
-
-```scala
-// spark-shell
-import org.apache.hudi.QuickstartUtils._
-import scala.collection.JavaConversions._
-import org.apache.spark.sql.SaveMode._
-import org.apache.hudi.DataSourceReadOptions._
-import org.apache.hudi.DataSourceWriteOptions._
-import org.apache.hudi.config.HoodieWriteConfig._
-import org.apache.spark.sql.types._
-import org.apache.spark.sql.Row
-
-
-val tableName = "hudi_cow"
-val basePath = "/user/hive/warehouse/hudi_cow"
-
-val schema = StructType(Array(
-StructField("rowId", StringType,true),
-StructField("partitionId", StringType,true),
-StructField("preComb", LongType,true),
-StructField("name", StringType,true),
-StructField("versionId", StringType,true),
-StructField("toBeDeletedStr", StringType,true),
-StructField("intToLong", IntegerType,true),
-StructField("longToInt", LongType,true)
-))
-
-val data0 = Seq(Row("row_1",
"2021/01/01",0L,"bob","v_0","toBeDel0",0,1000000L),
- Row("row_2",
"2021/01/01",0L,"john","v_0","toBeDel0",0,1000000L),
- Row("row_3", "2021/01/02",0L,"tom","v_0","toBeDel0",0,1000000L))
-
-var dfFromData0 = spark.createDataFrame(data0,schema)
-
-dfFromData0.write.format("hudi").
- options(getQuickstartWriteConfigs).
- option("hoodie.datasource.write.precombine.field", "preComb").
- option("hoodie.datasource.write.recordkey.field", "rowId").
- option("hoodie.datasource.write.partitionpath.field", "partitionId").
- option("hoodie.table.name", tableName).
- option("hoodie.datasource.write.table.type", 'COPY_ON_WRITE').
- option("hoodie.datasource.write.operation", "upsert").
- option("hoodie.index.type","SIMPLE").
- option("hoodie.datasource.write.hive_style_partitioning","true").
-
option("hoodie.datasource.hive_sync.jdbcurl","jdbc:hive2://hiveserver:10000/").
- option("hoodie.datasource.hive_sync.database","default").
- option("hoodie.datasource.hive_sync.table","hudi_cow").
- option("hoodie.datasource.hive_sync.partition_fields","partitionId").
- option("hoodie.datasource.hive_sync.enable","true").
- mode(Overwrite).
- save(basePath)
-```
-
-To query, connect to the hive server.
-
-```
-beeline -u jdbc:hive2://hiveserver:10000 \
- --hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat \
- --hiveconf hive.stats.autogather=false
-
-Beeline version 1.2.1.spark2 by Apache Hive
-0: jdbc:hive2://hiveserver:10000> show tables;
-+-----------+--+
-| tab_name |
-+-----------+--+
-| hudi_cow |
-+-----------+--+
-1 row selected (0.531 seconds)
-0: jdbc:hive2://hiveserver:10000> select * from hudi_cow limit 1;
-+-------------------------------+--------------------------------+------------------------------+----------------------------------+----------------------------------------------------------------------------+-----------------+-------------------+----------------+---------------------+--------------------------+---------------------+---------------------+-----------------------+--+
-| hudi_cow._hoodie_commit_time | hudi_cow._hoodie_commit_seqno |
hudi_cow._hoodie_record_key | hudi_cow._hoodie_partition_path |
hudi_cow._hoodie_file_name | hudi_cow.rowid
| hudi_cow.precomb | hudi_cow.name | hudi_cow.versionid |
hudi_cow.tobedeletedstr | hudi_cow.inttolong | hudi_cow.longtoint |
hudi_cow.partitionid |
-+-------------------------------+--------------------------------+------------------------------+----------------------------------+----------------------------------------------------------------------------+-----------------+-------------------+----------------+---------------------+--------------------------+---------------------+---------------------+-----------------------+--+
-| 20220120090023631 | 20220120090023631_1_2 | row_1
| partitionId=2021/01/01 |
0bf9b822-928f-4a57-950a-6a5450319c83-0_1-24-314_20220120090023631.parquet |
row_1 | 0 | bob | v_0 |
toBeDel0 | 0 | 1000000 |
2021/01/01 |
-+-------------------------------+--------------------------------+------------------------------+----------------------------------+----------------------------------------------------------------------------+-----------------+-------------------+----------------+---------------------+--------------------------+---------------------+---------------------+-----------------------+--+
-1 row selected (5.475 seconds)
-0: jdbc:hive2://hiveserver:10000>
-```