This is an automated email from the ASF dual-hosted git repository.
vinoyang pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 1148e18 [HUDI-1617] Adding a blog for KeyGenerators in Apache Hudi
(#2575)
1148e18 is described below
commit 1148e188ac761eb54f7af4c0547cdc9ba7a5e569
Author: Sivabalan Narayanan <[email protected]>
AuthorDate: Sat Feb 20 06:53:13 2021 -0500
[HUDI-1617] Adding a blog for KeyGenerators in Apache Hudi (#2575)
---
docs/_posts/2021-02-13-hudi-key-generators.md | 189 ++++++++++++++++++++++++++
1 file changed, 189 insertions(+)
diff --git a/docs/_posts/2021-02-13-hudi-key-generators.md
b/docs/_posts/2021-02-13-hudi-key-generators.md
new file mode 100644
index 0000000..7b69e70
--- /dev/null
+++ b/docs/_posts/2021-02-13-hudi-key-generators.md
@@ -0,0 +1,189 @@
+---
+title: "Apache Hudi Key Generators"
+excerpt: "Different key generators available with Apache Hudi"
+author: sivabalan
+category: blog
+---
+
+Every record in Hudi is uniquely identified by a HoodieKey, which is a pair of
record key and partition path where the
+record belongs to. Hudi has imposed this constraint so that updates and
deletes can be applied to the record of interest.
+Hudi relies on the partition path field to partition your dataset and records
within a partition have unique record keys.
+Since uniqueness is guaranteed only within the partition, there could be
records with same record keys across different
+partitions. One should choose the partition field wisely as it could be a
determining factor for your ingestion and
+query latency.
+
+## Key Generators
+
+Hudi exposes a number of out of the box key generators that customers can use
based on their need. Or can have their
+own implementation for the KeyGenerator. This blog goes over all different
types of key generators that are readily
+available to use.
+
+[Here](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/KeyGenerator.java)
+is the interface for KeyGenerator in Hudi for your reference.
+
+Before diving into different types of key generators, let’s go over some of
the common configs required to be set for
+key generators.
+
+| Config | Meaning/purpose|
+| ------------- |:-------------:|
+| ```hoodie.datasource.write.recordkey.field``` | Refers to record key
field. This is a mandatory field. |
+| ```hoodie.datasource.write.partitionpath.field``` | Refers to partition
path field. This is a mandatory field. |
+| ```hoodie.datasource.write.keygenerator.class``` | Refers to Key generator
class(including full path). Could refer to any of the available ones or user
defined one. This is a mandatory field. |
+| ```hoodie.datasource.write.partitionpath.urlencode```| When set to true,
partition path will be url encoded. Default value is false. |
+| ```hoodie.datasource.write.hive_style_partitioning```| When set to true,
uses hive style partitioning. Partition field name will be prefixed to the
value. Format: “<partition_path_field_name>=<partition_path_value>”. Default
value is false.|
+
+There are few more configs involved if you are looking for
TimestampBasedKeyGenerator. Will cover those in the respective section.
+
+Lets go over different key generators available to be used with Hudi.
+
+###
[SimpleKeyGenerator](https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/SimpleKeyGenerator.java)
+
+Record key refers to one field(column in dataframe) by name and partition path
refers to one field (single column in dataframe)
+by name. This is one of the most commonly used one. Values are interpreted as
is from dataframe and converted to string.
+
+###
[ComplexKeyGenerator](https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/ComplexKeyGenerator.java)
+Both record key and partition paths comprise one or more than one field by
name(combination of multiple fields). Fields
+are expected to be comma separated in the config value. For example
```"Hoodie.datasource.write.recordkey.field" : “col1,col4”```
+
+###
[GlobalDeleteKeyGenerator](https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/GlobalDeleteKeyGenerator.java)
+Global index deletes do not require partition value. So this key generator
avoids using partition value for generating HoodieKey.
+
+###
[TimestampBasedKeyGenerator](https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/TimestampBasedKeyGenerator.java)
+This key generator relies on timestamps for the partition field. The field
values are interpreted as timestamps
+and not just converted to string while generating partition path value for
records. Record key is same as before where it is chosen by
+field name. Users are expected to set few more configs to use this
KeyGenerator.
+
+Configs to be set:
+
+| Config | Meaning/purpose |
+| ------------- | -------------|
+| ```hoodie.deltastreamer.keygen.timebased.timestamp.type``` | One of the
timestamp types supported(UNIX_TIMESTAMP, DATE_STRING, MIXED,
EPOCHMILLISECONDS, SCALAR) |
+| ```hoodie.deltastreamer.keygen.timebased.output.dateformat```| Output date
format |
+| ```hoodie.deltastreamer.keygen.timebased.timezone```| Timezone of the data
format|
+| ```oodie.deltastreamer.keygen.timebased.input.dateformat```| Input date
format |
+
+Let's go over some example values for TimestampBasedKeyGenerator.
+
+#### Timestamp is GMT
+
+| Config field | Value |
+| ------------- | -------------|
+|```hoodie.deltastreamer.keygen.timebased.timestamp.type```|
"EPOCHMILLISECONDS"|
+|```hoodie.deltastreamer.keygen.timebased.output.dateformat``` | "yyyy-MM-dd
hh" |
+|```hoodie.deltastreamer.keygen.timebased.timezone```| "GMT+8:00" |
+
+Input Field value: “1578283932000L” <br/>
+Partition path generated from key generator: “2020-01-06 12”
+
+If input field value is null for some rows. <br/>
+Partition path generated from key generator: “1970-01-01 08”
+
+#### Timestamp is DATE_STRING
+
+| Config field | Value |
+| ------------- | -------------|
+|```hoodie.deltastreamer.keygen.timebased.timestamp.type```| "DATE_STRING" |
+|```hoodie.deltastreamer.keygen.timebased.output.dateformat```| "yyyy-MM-dd
hh" |
+|```hoodie.deltastreamer.keygen.timebased.timezone```| "GMT+8:00" |
+|```hoodie.deltastreamer.keygen.timebased.input.dateformat```| "yyyy-MM-dd
hh:mm:ss" |
+
+Input field value: “2020-01-06 12:12:12” <br/>
+Partition path generated from key generator: “2020-01-06 12”
+
+If input field value is null for some rows. <br/>
+Partition path generated from key generator: “1970-01-01 12:00:00”
+<br/>
+
+#### Scalar examples
+
+| Config field | Value |
+| ------------- | -------------|
+|```hoodie.deltastreamer.keygen.timebased.timestamp.type```| "SCALAR"|
+|```hoodie.deltastreamer.keygen.timebased.output.dateformat```| "yyyy-MM-dd
hh" |
+|```hoodie.deltastreamer.keygen.timebased.timezone```| "GMT" |
+|```hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit```|
"days" |
+
+Input field value: “20000L” <br/>
+Partition path generated from key generator: “2024-10-04 12”
+
+If input field value is null. <br/>
+Partition path generated from key generator: “1970-01-02 12”
+
+#### ISO8601WithMsZ with Single Input format
+
+| Config field | Value |
+| ------------- | -------------|
+|```hoodie.deltastreamer.keygen.timebased.timestamp.type```| "DATE_STRING"|
+|```hoodie.deltastreamer.keygen.timebased.input.dateformat```|
"yyyy-MM-dd'T'HH:mm:ss.SSSZ" |
+|```hoodie.deltastreamer.keygen.timebased.input.dateformat.list.delimiter.regex```|
"" |
+|```hoodie.deltastreamer.keygen.timebased.input.timezone```| "" |
+|```hoodie.deltastreamer.keygen.timebased.output.dateformat```| "yyyyMMddHH" |
+|```hoodie.deltastreamer.keygen.timebased.output.timezone```| "GMT" |
+
+Input field value: "2020-04-01T13:01:33.428Z" <br/>
+Partition path generated from key generator: "2020040113"
+
+#### ISO8601WithMsZ with Multiple Input formats
+
+| Config field | Value |
+| ------------- | -------------|
+|```hoodie.deltastreamer.keygen.timebased.timestamp.type```| "DATE_STRING"|
+|```hoodie.deltastreamer.keygen.timebased.input.dateformat```|
"yyyy-MM-dd'T'HH:mm:ssZ,yyyy-MM-dd'T'HH:mm:ss.SSSZ" |
+|```hoodie.deltastreamer.keygen.timebased.input.dateformat.list.delimiter.regex```|
"" |
+|```hoodie.deltastreamer.keygen.timebased.input.timezone```| "" |
+|```hoodie.deltastreamer.keygen.timebased.output.dateformat```| "yyyyMMddHH" |
+|```hoodie.deltastreamer.keygen.timebased.output.timezone```| "UTC" |
+
+Input field value: "2020-04-01T13:01:33.428Z" <br/>
+Partition path generated from key generator: "2020040113"
+
+#### ISO8601NoMs with offset using multiple input formats
+
+| Config field | Value |
+| ------------- | -------------|
+|```hoodie.deltastreamer.keygen.timebased.timestamp.type```| "DATE_STRING"|
+|```hoodie.deltastreamer.keygen.timebased.input.dateformat```|
"yyyy-MM-dd'T'HH:mm:ssZ,yyyy-MM-dd'T'HH:mm:ss.SSSZ" |
+|```hoodie.deltastreamer.keygen.timebased.input.dateformat.list.delimiter.regex```|
"" |
+|```hoodie.deltastreamer.keygen.timebased.input.timezone```| "" |
+|```hoodie.deltastreamer.keygen.timebased.output.dateformat```| "yyyyMMddHH" |
+|```hoodie.deltastreamer.keygen.timebased.output.timezone```| "UTC" |
+
+Input field value: "2020-04-01T13:01:33-**05:00**" <br/>
+Partition path generated from key generator: "2020040118"
+
+#### Input as short date string and expect date in date format
+
+| Config field | Value |
+| ------------- | -------------|
+|```hoodie.deltastreamer.keygen.timebased.timestamp.type```| "DATE_STRING"|
+|```hoodie.deltastreamer.keygen.timebased.input.dateformat```|
"yyyy-MM-dd'T'HH:mm:ssZ,yyyy-MM-dd'T'HH:mm:ss.SSSZ,yyyyMMdd" |
+|```hoodie.deltastreamer.keygen.timebased.input.dateformat.list.delimiter.regex```|
"" |
+|```hoodie.deltastreamer.keygen.timebased.input.timezone```| "UTC" |
+|```hoodie.deltastreamer.keygen.timebased.output.dateformat```| "MM/dd/yyyy" |
+|```hoodie.deltastreamer.keygen.timebased.output.timezone```| "UTC" |
+
+Input field value: "220200401" <br/>
+Partition path generated from key generator: "04/01/2020"
+
+###
[CustomKeyGenerator](https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/CustomKeyGenerator.java)
+This is a generic implementation of KeyGenerator where users are able to
leverage the benefits of SimpleKeyGenerator,
+ComplexKeyGenerator and TimestampBasedKeyGenerator all at the same time. One
can configure record key and partition
+paths as a single field or a combination of fields. This keyGenerator is
particularly useful if you want to define
+complex partition paths involving regular fields and timestamp based fields.
It expects value for prop ```"hoodie.datasource.write.partitionpath.field"```
+in a specific format. The format should be
"field1:PartitionKeyType1,field2:PartitionKeyType2..."
+
+The complete partition path is created as
+```<value for field1 basis PartitionKeyType1>/<value for field2 basis
PartitionKeyType2> ```
+and so on. Each partition key type could either be SIMPLE or TIMESTAMP.
+
+Example config value: ```“field_3:simple,field_5:timestamp”```
+
+RecordKey config value is either single field incase of SimpleKeyGenerator or
a comma separate field names if referring to ComplexKeyGenerator.
+Eg: “col1” or “col3,col4”.
+
+###
[NonPartitionedKeyGenerator](https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/NonpartitionedKeyGenerator.java)
+If your hudi dataset is not partitioned, you could use this
“NonPartitionedKeyGenerator” which will return an empty
+partition for all records. In other words, all records go to the same
partition (which is empty “”)
+
+Hope this blog gave you a good understanding of different types of Key
Generators available in Apache Hudi. Thanks for your continued support for
Hudi's community.
+