nsivabalan commented on a change in pull request #2575:
URL: https://github.com/apache/hudi/pull/2575#discussion_r575720309



##########
File path: docs/_posts/2021-02-13-hudi-key-generators.md
##########
@@ -0,0 +1,139 @@
+---
+title: "Apache Hudi Key Generators"
+excerpt: "Different key generators available with Apache Hudi"
+author: sivabalan, pratyaksh
+category: blog
+---
+
+Every record in Hudi is uniquely identified by a HoodieKey, which is a pair of 
record key and partition path where the 
+record belongs to. Hudi has imposed this constraint so that updates and 
deletes can be applied to the record of interest. 
+Hudi relies on the partition path field to partition your dataset and records 
within a partition have unique record keys. 
+Since uniqueness is guaranteed only within the partition, there could be 
records with same record keys across different 
+partitions. One should choose the partition field wisely as it could be a 
determining factor for your ingestion and 
+query latency.
+
+## Key Generators
+
+Hudi exposes a number of out of the box key generators that customers can use 
based on their need. Or can have their 
+own implementation for the KeyGenerator. This blog goes over all different 
types of key generators that are readily 
+available to use.
+
+[Here](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/KeyGenerator.java)
+is the interface for KeyGenerator in Hudi for your reference.
+
+Before diving into different types of key generators, let’s go over some of 
the common configs required to be set for 
+key generators.
+
+| Config        | Meaning/purpose|        
+| ------------- |:-------------:| 
+| ```hoodie.datasource.write.recordkey.field```     | Refers to record key 
field. This is a mandatory field. | 
+| ```hoodie.datasource.write.partitionpath.field```     | Refers to partition 
path field. This is a mandatory field. | 
+| ```hoodie.datasource.write.keygenerator.class```| Refers to Key generator 
class(including full path). Could refer to any of the available ones or user 
defined one. This is a mandatory field.
+ | 
+| ```hoodie.datasource.write.partitionpath.urlencode```| When set to true, 
partition path will be url encoded. Default value is false. |
+| ```hoodie.datasource.write.hive_style_partitioning```| When set to true, 
uses hive style partitioning. Partition field name will be prefixed to the 
value. Format: “<partition_path_field_name>=<partition_path_value>”. Default 
value is false.|
+
+There are few more configs involved if you are looking for 
TimestampBasedKeyGenerator. Will cover those in the respective section.
+
+Lets go over different key generators available to be used with Hudi.
+
+### 
[SimpleKeyGenerator](https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/SimpleKeyGenerator.java)
+
+Record key refers to one field(column in dataframe) by name and partition path 
refers to one field (single column in dataframe) 
+by name. This is one of the most commonly used one. Values are interpreted as 
is from dataframe and converted to string.
+
+### 
[ComplexKeyGenerator](https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/ComplexKeyGenerator.java)
+Both record key and partition paths comprise one or more than one field by 
name(combination of multiple fields). Fields 
+are expected to be comma separated in the config value. For example 
```"Hoodie.datasource.write.recordkey.field" : “col1,col4”```
+
+### 
[GlobalDeleteKeyGenerator](https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/GlobalDeleteKeyGenerator.java)
+Global index deletes do not require partition value. So this key generator 
avoids using partition value for generating HoodieKey.
+
+### 
[TimestampBasedKeyGenerator](https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/TimestampBasedKeyGenerator.java)
+This key generator relies on timestamps for the partition field. The field 
values are interpreted as timestamps 
+and not just converted to string while generating partition path value for 
records.  Record key is same as before where it is chosen by 
+field name.  Users are expected to set few more configs to use this 
KeyGenerator.
+
+Configs to be set:
+
+| Config        | Meaning/purpose |       
+| ------------- | -------------|
+| ```hoodie.deltastreamer.keygen.timebased.timestamp.type```    | One of the 
timestamp types supported(UNIX_TIMESTAMP, DATE_STRING, MIXED, 
EPOCHMILLISECONDS, SCALAR) |
+| ```hoodie.deltastreamer.keygen.timebased.output.dateformat```| Output date 
format | 
+| ```hoodie.deltastreamer.keygen.timebased.timezone```| Timezone of the data 
format| 
+| ```oodie.deltastreamer.keygen.timebased.input.dateformat```| Input date 
format |
+
+Lets go over some example values for TimestampBasedKeyGenerator.
+
+<br/>
+// Timestamp is GMT
+
+| Config field | Value |
+| ------------- | -------------|
+|```hoodie.deltastreamer.keygen.timebased.timestamp.type```| 
"EPOCHMILLISECONDS"|
+|```hoodie.deltastreamer.keygen.timebased.output.dateformat``` | "yyyy-MM-dd 
hh" |
+|```hoodie.deltastreamer.keygen.timebased.timezone```| "GMT+8:00" |
+
+Input Field value: “1578283932000L”
+Partition path generated from key generator: “2020-01-06 12”
+
+If input field value is null for some rows.
+Partition path generated from key generator: “1970-01-01 08”
+
+<br/>
+// Timestamp is DATE_STRING
+
+| Config field | Value |
+| ------------- | -------------|
+|```hoodie.deltastreamer.keygen.timebased.timestamp.type```|  "DATE_STRING"  |
+|```hoodie.deltastreamer.keygen.timebased.output.dateformat```|  "yyyy-MM-dd 
hh" | 
+|```hoodie.deltastreamer.keygen.timebased.timezone```|  "GMT+8:00" |
+|```hoodie.deltastreamer.keygen.timebased.input.dateformat```|  "yyyy-MM-dd 
hh:mm:ss" |
+
+Input field value: “2020-01-06 12:12:12”
+Partition path generated from key generator: “2020-01-06 12”
+
+If input field value is null for some rows.
+Partition path generated from key generator: “1970-01-01 12:00:00”
+<br/>
+<br/>
+
+// Scalar examples
+
+| Config field | Value |
+| ------------- | -------------|
+|```hoodie.deltastreamer.keygen.timebased.timestamp.type```| "SCALAR"|
+|```hoodie.deltastreamer.keygen.timebased.output.dateformat```| "yyyy-MM-dd 
hh" |
+|```hoodie.deltastreamer.keygen.timebased.timezone```| "GMT" |
+|```hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit```| 
"days" |
+
+Input field value: “20000L”
+Partition path generated from key generator: “2024-10-04 12”
+
+If input field value is null.
+Partition path generated from key generator: “1970-01-02 12”
+
+// More to be filled in.

Review comment:
       @pratyakshsharma : if you can fill in these details, we can open it up 
for review.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to