This is an automated email from the ASF dual-hosted git repository.
bhavanisudha pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new c57f72b12df [DOCS] Added blog for out of box key generators in hudi
(#12645)
c57f72b12df is described below
commit c57f72b12df4fb5d5ca9fe526fc17fa68c6cddc9
Author: Aditya Goenka <[email protected]>
AuthorDate: Tue Jan 28 19:45:47 2025 +0530
[DOCS] Added blog for out of box key generators in hudi (#12645)
* Added blog for key generators
* Refractored some content
* Update 2025-01-15-outofbox-key-generators-in-hudi.mdx
---
.../2024-11-19-automated-small-file-handling.md | 4 +-
.../2025-01-15-outofbox-key-generators-in-hudi.mdx | 151 +++++++++++++++++++++
2 files changed, 153 insertions(+), 2 deletions(-)
diff --git a/website/blog/2024-11-19-automated-small-file-handling.md
b/website/blog/2024-11-19-automated-small-file-handling.md
index 0280bd332d5..c49c7064768 100644
--- a/website/blog/2024-11-19-automated-small-file-handling.md
+++ b/website/blog/2024-11-19-automated-small-file-handling.md
@@ -80,10 +80,10 @@ We can refer this blog for in-depth details of the
functionality - https://hudi
We use following configs to configure this -
- * __Hoodie.parquet.max.file.size (Default 128 MB)__
+ * __hoodie.parquet.max.file.size (Default 128 MB)__
This setting specifies the target size, in bytes, for Parquet files generated
during Hudi write phases. The writer will attempt to create files that approach
this target size. For example, if an existing file is 80 MB, the writer will
allocate only 40 MB to that particular file group.
- * __Hoodie.parquet.small.file.limit (Default 100 MB)__
+ * __hoodie.parquet.small.file.limit (Default 100 MB)__
This setting defines the maximum file size for a data file to be classified as
a small file. Files below this threshold are considered small files, prompting
the system to allocate additional records to their respective file groups in
subsequent write phases.
* __hoodie.copyonwrite.record.size.estimate (Default 1024)__
diff --git a/website/blog/2025-01-15-outofbox-key-generators-in-hudi.mdx
b/website/blog/2025-01-15-outofbox-key-generators-in-hudi.mdx
new file mode 100644
index 00000000000..2752e7722f3
--- /dev/null
+++ b/website/blog/2025-01-15-outofbox-key-generators-in-hudi.mdx
@@ -0,0 +1,151 @@
+---
+title: "Out of the box Key Generators in Apache Hudi"
+excerpt: "Explain need for key gerators and out of box key generators in
Apache Hudi"
+author: Aditya Goenka
+category: blog
+image:
/assets/images/blog/2024-06-07-apache-hudi-a-deep-dive-with-python-code-examples.png
+tags:
+- Data Lake
+- Data Lakehouse
+- Apache Hudi
+- Key Generators
+- partition
+---
+
+## Introduction
+The goal of Apache Hudi is to bring database-like features to data lakes. This
addresses the main shortcoming of traditional data lakes: the inability to
easily perform row-level updates or deletions.By integrating database-like
management capabilities into data lakes, Hudi revolutionizes how it handles and
processes large volumes of data, enabling out-of-the-box upserts and deletes
that facilitate efficient record level updating and deletion.
+One of Hudi's key innovations is the ability for users to explicitly define a
Record Key, similar to a unique key in traditional databases, along with a
Partition Key that aligns with the data lake paradigm. These two keys make the
[HoodieKey](https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieKey.java)
that aligns with the data lake paradigm. These two keys make the HoodieKey
which is similar to the primary key which uniquely defines [...]
+In this blog, we will explore the concept of Key Generators in Apache Hudi,
how they enhance data management, and their role in enabling efficient data
operations in modern data lakes.
+
+## Challenge
+The biggest challenge in defining the record key and partition key on a table
is the columns in input data does not naturally lend itself to being used as a
primary key or partition key directly. In the realm of databases, we often have
below cases -
+- Need to have multiple fields that serve as primary key commonly known as
composite keys in the database.
+- It is necessary to preprocess the data to derive a specific field that can
serve as a primary key before loading it into the database.
+- Sometimes we have to generate unique ids also. Common use case is surrogate
key.
+
+Similarly, for partition columns also in datalakes, most of the time the raw
field can’t be used as a partition key.
+- Partition columns often have time grain like month level or year level
partition but input data mostly contain timestamp and date.
+- Nested primary keys are very common, and necessitates multiple partition
columns.
+
+## Approaches to Handling this in Data Pipelines
+Data Lake and Lakehouse technologies typically address such scenarios by
preprocessing the data. For example, if date-based partitioning is required and
a timestamp column is available, the data must be processed using Spark SQL
date functions to extract relevant components (e.g., year, month, day). These
derived columns are then used for partitioning. However, this process can
become cumbersome at scale, especially when multiple data streams are writing
to the same Hudi table. The same [...]
+Hudi addresses these challenges with a built-in solution: key generators.
These can be configured at the table level, eliminating the need to repeatedly
apply the same logic. With key generators, Hudi automatically handles the
conversion process every time, ensuring consistency and reducing the risk of
errors.
+
+## What are Key Generators in Apache Hudi
+[Key generators](https://hudi.apache.org/docs/key_generation) in Apache Hudi
are essential components responsible for creating record keys and partition
keys for records within a dataset. Hudi uses key generators to extract the Hudi
record key, which is a combination of the record key and the partition key,
from the incoming record fields. This process allows Hudi to efficiently
prepare the hoodie key on which updates can occur. During upserts, Hudi
identifies the file group that contain [...]
+Hudi offers several built-in key generator implementations that cover common
use cases, such as generating record keys based on fields from the input data.
However, to provide flexibility and support for more complex use cases, Hudi
also offers a pluggable interface. This allows users to implement custom key
generators tailored to their specific requirements.
+To create a custom key generator, you can extend the
[BaseKeyGenerator](https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/keygen/BaseKeyGenerator.java)
class which itself extends the
[KeyGenerator](https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/keygen/KeyGenerator.java)
class and implement methods such as getRecordKey and getPartitionKey. This
enables you to define the specific logic required for calculating record [...]
+The key generator is configured at the table level and stored in the
hoodie.properties file, which resides within the .hoodie directory. This file
contains all the table-level configurations, including the key generation
settings. Once a table is created with a particular key generator we can’t
change it. It can be set using the configuration
hoodie.datasource.write.keygenerator.class
+
+## Out of the Box Key Generators
+
+### SimpleKeyGenerator
+The SimpleKeyGenerator is a basic key generator used in Apache Hudi when
direct fields from the input dataset can serve as both the record key and
partition key. It maps a specific column in the DataFrame to the record key and
another column to the partition path. This widely-used generator interprets
values as-is from the DataFrame and converts them to strings, making it ideal
for straightforward data structures.
+Please note that this is the default key generator for the partitioned
datasets.
+```shell
+{
+ "hoodie.datasource.write.recordkey.field": "id",
+ "hoodie.datasource.write.partitionpath.field": "date",
+ "hoodie.datasource.write.keygenerator.class":
"org.apache.hudi.keygen.SimpleKeyGenerator"
+}
+```
+
+### NonpartitionedKeyGenerator
+The NonpartitionedKeyGenerator is a key generator in Apache Hudi designed
specifically for non-partitioned datasets. Unlike the SimpleKeyGenerator, which
uses a field to determine the partition path for the data, the
NonpartitionedKeyGenerator does not assign a partition key to the records.
Instead, it returns an empty string as the partition key for all records. This
is because the dataset is non-partitioned, meaning all records are stored in a
single partition.
+```shell
+{
+ "hoodie.datasource.write.recordkey.field": "id",
+ "hoodie.datasource.write.keygenerator.class":
"org.apache.hudi.keygen.NonpartitionedKeyGenerator"
+}
+```
+
+### ComplexKeyGenerator
+This key generator is used when multiple fields are used to create the record
key or partition key. We can provide the comma separated list of the columns.
In the output, the hoodie record key is generated using the format
key1:value1,key2:value2. If any one of the partition key or record key contains
multiple fields, then we have to use ComplexKeyGenerator.
+```shell
+{
+ "hoodie.datasource.write.keygenerator.class" :
"org.apache.hudi.keygen.ComplexKeyGenerator",
+ "hoodie.datasource.write.recordkey.field" = "key1,key2",
+ "hoodie.datasource.write.partitionpath.field" = "country,state,city"
+}
+```
+
+### TimestampBasedKeygenerator
+The TimestampBasedKeyGenerator allows you to generate partition keys based on
timestamp fields in your data. This is especially useful when you want to
partition your data by date, month, or year, depending on your use case. The
key generator can transform timestamps into different formats, enabling you to
create partitions that suit your analytical needs.
+
+Relevant Configurations
+* __hoodie.datasource.write.keygenerator.class__
+To use this key generator, The key gen class should be
`org.apache.hudi.keygen.TimestampBasedKeyGenerator`
+
+* __hoodie.deltastreamer.keygen.timebased.timestamp.type__
+This config determines the nature of the value of input. Below can be the
possible values for this -
+**DATE_STRING**: Use this when the input value is in string format.
+
+ - MIXED: This option allows for a combination of formats.
+
+ - UNIX_TIMESTAMP: Select this when the input value is in epoch timestamp
format (long type) measured in seconds.
+
+ - EPOCHMILLISECONDS: Use this when the input value is in epoch timestamp
format (long type) measured in milliseconds.
+
+ - SCALAR: This option is for epoch timestamp values (long type) where you
can specify any time unit.
+
+
+* __hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit__
+When using the SCALAR timestamp type, you can define the unit of the epoch
time. Valid options include NANOSECONDS, MICROSECONDS, MILLISECONDS, SECONDS,
MINUTES, HOURS, DAYS
+
+* __hoodie.keygen.timebased.input.dateformat__
+When the timestamp type is DATE_STRING or MIXED, this config can be defined to
specify the date format in which the field is coming in input.
+
+* __hoodie.keygen.timebased.output.dateformat__
+When the timestamp type is set to DATE_STRING or MIXED, this configuration
defines the desired date format for the output field. It allows you to specify
how the date should be formatted when it is generated or output.
+
+* __hoodie.deltastreamer.keygen.timebased.input.timezone__
+This setting specifies the timezone for the input date field derived from the
raw data. The default value is UTC.
+
+* __hoodie.deltastreamer.keygen.timebased.output.timezone__
+This setting defines the timezone for the output date field that will be used
to populate the partition column. The default value is UTC.
+
+#### Common Use Cases
+- Data Contains Timestamp Field and We Want Date Level Partitions
+In this scenario, you have a dataset with a timestamp field, and you want to
partition the data by the date (i.e., year-month-day).
+```shell
+{
+ "hoodie.datasource.write.keygenerator.class":
"org.apache.hudi.keygen.TimestampBasedKeyGenerator",
+ "hoodie.deltastreamer.keygen.timebased.timestamp.type": "DATE_STRING",
+ "hoodie.keygen.timebased.input.dateformat":"yyyy-MM-dd'T'HH:mm:ss.SSSSSSZ",
+ "hoodie.keygen.timebased.output.dateformat":"yyyy-MM-dd",
+ "hoodie.datasource.write.partitionpath.field": "event_time"
+}
+```
+
+- Data Contains Date Field but We Want to Have Month or Year Level Partitions
+Here, you have a dataset with a date field, but you want to create partitions
at a higher granularity, such as by month or year.
+```shell
+{
+ "hoodie.datasource.write.keygenerator.class":
"org.apache.hudi.keygen.TimestampBasedKeyGenerator",
+ "hoodie.deltastreamer.keygen.timebased.timestamp.type": "DATE_STRING",
+ "hoodie.keygen.timebased.input.dateformat":"yyyy-MM-dd",
+ "hoodie.keygen.timebased.output.dateformat":"yyyyMM",
+ "hoodie.datasource.write.partitionpath.field": "event_date"
+}
+```
+
+In the example above, if we have an input with a date column named event_date
in the format 'yyyy-MM-dd', the configurations will convert this format to a
monthly level in the format 'yyyyMM' and use it as the partition column.
+
+We can refer
[TimestampBasedKeyGenerator](https://hudi.apache.org/docs/0.10.0/key_generation/#timestampbasedkeygenerator)
for more examples
+
+### CustomKeyGenerator
+In typical use cases, using the same key generator for both the record key and
the partition key often does not meet the requirements. For such scenarios, a
Custom Key Generator is particularly useful, as it allows for the use of
different key generators for different fields.
+A common use case arises when the partition key consists of multiple fields,
and you also need to extract date or month-level partitions from a timestamp
field. In these situations, it is essential to utilize both the
TimestampBasedKeyGenerator and the ComplexKeyGenerator. However, since you
cannot specify two different key generator classes simultaneously, the
CustomKeyGenerator serves as an effective solution. We can configure it as list
of comma separated fields with the key generator [...]
+When we pass the partition column, we can also provide which key generator to
use. The configurations below enable you to use SimpleKeyGenerator to extract
the country field and TimestampBasedKeygenerator to transform the event_date
field to use only month level partitions.
+```shell
+{
+ "hoodie.datasource.write.keygenerator.class":
"org.apache.hudi.keygen.TimestampBasedKeyGenerator",
+ "hoodie.deltastreamer.keygen.timebased.timestamp.type": "DATE_STRING",
+ "hoodie.keygen.timebased.input.dateformat":"yyyy-MM-dd",
+ "hoodie.keygen.timebased.output.dateformat":"yyyyMM",
+ "hoodie.datasource.write.partitionpath.field":
"country:SIMPLE,event_date:TIMESTAMP"
+}
+```
+
+## Conclusion
+Key generators in Hudi are vital components that enable efficient record
identification, partitioning, and data operations in large datasets. Whether
you're performing upserts, deletes, or managing time-series data, choosing the
right key generator ensures that Hudi can handle the data efficiently, while
aligning with your business logic. By addressing challenges like composite
keys, timestamp-based partitioning, and complex use cases, Apache Hudi
revolutionizes how data lakes handle evo [...]