This is an automated email from the ASF dual-hosted git repository.
bhavanisudha pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 3e85f4d80e [DOCS] Add image assets and fix blog post styles (#5613)
3e85f4d80e is described below
commit 3e85f4d80ee8fb5a364e67de56edc066ca2329e3
Author: Bhavani Sudha Saktheeswaran <[email protected]>
AuthorDate: Tue May 17 15:37:25 2022 -0700
[DOCS] Add image assets and fix blog post styles (#5613)
Co-authored-by: Bhavani Sudha Saktheeswaran <[email protected]>
---
...-efficient-migration-of-large-parquet-tables.md | 22 ++++++++---------
...2020-08-21-async-compaction-deployment-model.md | 14 +++++------
...gh-perf-data-lake-with-hudi-and-alluxio-t3go.md | 26 ++++++++++-----------
website/blog/2021-01-27-hudi-clustering-intro.md | 20 ++++++++--------
website/blog/2021-03-01-hudi-file-sizing.md | 6 ++---
website/blog/2021-08-18-virtual-keys.md | 18 +++++++-------
...se-concurrency-control-are-we-too-optimistic.md | 9 ++++---
...hudi-zorder-and-hilbert-space-filling-curves.md | 10 ++++----
...atures-from-Apache-Hudi-0.9.0-on-Amazon-EMR.mdx | 1 +
...efficiency-at-scale-in-big-data-file-format.png | Bin 0 -> 38755 bytes
...2022-02-02-onehouse-commitment-to-openness.jpeg | Bin 0 -> 386047 bytes
.../images/blog/2022-02-03-onehouse_billboard.png | Bin 0 -> 554823 bytes
.../2022-02-17-fresher-data-lake-on-aws-s3.png | Bin 0 -> 96170 bytes
...1-low-latency-pipeline-using-msk-flink-hudi.png | Bin 0 -> 40488 bytes
...3-09-serverless-pipeline-using-glue-hudi-s3.png | Bin 0 -> 142433 bytes
.../2022-04-04-halodoc-lakehouse-architecture.png | Bin 0 -> 251301 bytes
.../images/blog/2022-05-17-multimodal-index.gif | Bin 0 -> 607295 bytes
17 files changed, 63 insertions(+), 63 deletions(-)
diff --git
a/website/blog/2020-08-20-efficient-migration-of-large-parquet-tables.md
b/website/blog/2020-08-20-efficient-migration-of-large-parquet-tables.md
index cd959ce28f..75144dc907 100644
--- a/website/blog/2020-08-20-efficient-migration-of-large-parquet-tables.md
+++ b/website/blog/2020-08-20-efficient-migration-of-large-parquet-tables.md
@@ -8,14 +8,14 @@ category: blog
We will look at how to migrate a large parquet table to Hudi without having to
rewrite the entire dataset.
<!--truncate-->
-# Motivation:
+## Motivation:
Apache Hudi maintains per record metadata to perform core operations such as
upserts and incremental pull. To take advantage of Hudi’s upsert and
incremental processing support, users would need to rewrite their whole dataset
to make it an Apache Hudi table. Hudi 0.6.0 comes with an ***experimental
feature*** to support efficient migration of large Parquet tables to Hudi
without the need to rewrite the entire dataset.
-# High Level Idea:
+## High Level Idea:
-## Per Record Metadata:
+### Per Record Metadata:
Apache Hudi maintains record level metadata for perform efficient upserts and
incremental pull.
@@ -31,11 +31,11 @@ The parts (1) and (3) constitute what we term as “Hudi
skeleton”. Hudi skel

-# Design Deep Dive:
+## Design Deep Dive:
For a deep dive on the internals, please take a look at the [RFC
document](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+12+%3A+Efficient+Migration+of+Large+Parquet+Tables+to+Apache+Hudi)
-# Migration:
+## Migration:
Hudi supports 2 modes when migrating parquet tables. We will use the term
bootstrap and migration interchangeably in this document.
@@ -45,10 +45,10 @@ Hudi supports 2 modes when migrating parquet tables. We
will use the term boots
You can pick and choose these modes at partition level. One of the common
strategy would be to use FULL_RECORD mode for a small set of "hot" partitions
which are accessed more frequently and METADATA_ONLY for a larger set of "warm"
partitions.
-## Query Engine Support:
+### Query Engine Support:
For a METADATA_ONLY bootstrapped table, Spark - data source, Spark-Hive and
native Hive query engines are supported. Presto support is in the works.
-## Ways To Migrate :
+### Ways To Migrate :
There are 2 ways to migrate a large parquet table to Hudi.
@@ -57,7 +57,7 @@ There are 2 ways to migrate a large parquet table to Hudi.
We will look at how to migrate using both these approaches.
-## Configurations:
+### Configurations:
These are bootstrap specific configurations that needs to be set in addition
to regular hudi write configurations.
@@ -73,7 +73,7 @@ These are bootstrap specific configurations that needs to be
set in addition to
| hoodie.bootstrap.mode.selector.regex.mode |METADATA_ONLY |No |Bootstrap Mode
used when the partition matches the regex pattern in
hoodie.bootstrap.mode.selector.regex . Used only when
hoodie.bootstrap.mode.selector set to BootstrapRegexModeSelector. |
| hoodie.bootstrap.mode.selector.regex |\.\* |No |Partition Regex used when
hoodie.bootstrap.mode.selector set to BootstrapRegexModeSelector. |
-## Spark Data Source:
+### Spark Data Source:
Here, we use a Spark Datasource Write to perform bootstrap.
Here is an example code snippet to perform METADATA_ONLY bootstrap.
@@ -127,7 +127,7 @@ bootstrapDF.write
.save(basePath)
```
-## Hoodie DeltaStreamer:
+### Hoodie DeltaStreamer:
Hoodie Deltastreamer allows bootstrap to be performed using --run-bootstrap
command line option.
@@ -170,6 +170,6 @@ spark-submit --package
org.apache.hudi:hudi-spark-bundle_2.11:0.6.0
--hoodie-conf hoodie.bootstrap.mode.selector.regex.mode=METADATA_ONLY
```
-## Known Caveats
+### Known Caveats
1. Need proper defaults for the bootstrap config :
hoodie.bootstrap.full.input.provider. Here is the
[ticket](https://issues.apache.org/jira/browse/HUDI-1213)
1. DeltaStreamer manages checkpoints inside hoodie commit files and expects
checkpoints in previously committed metadata. Users are expected to pass
checkpoint or initial checkpoint provider when performing bootstrap through
deltastreamer. Such support is not present when doing bootstrap using Spark
Datasource. Here is the
[ticket](https://issues.apache.org/jira/browse/HUDI-1214).
diff --git a/website/blog/2020-08-21-async-compaction-deployment-model.md
b/website/blog/2020-08-21-async-compaction-deployment-model.md
index 3ffa1b4508..5e6eec2657 100644
--- a/website/blog/2020-08-21-async-compaction-deployment-model.md
+++ b/website/blog/2020-08-21-async-compaction-deployment-model.md
@@ -7,7 +7,7 @@ category: blog
We will look at different deployment models for executing compactions
asynchronously.
<!--truncate-->
-# Compaction
+## Compaction
For Merge-On-Read table, data is stored using a combination of columnar (e.g
parquet) + row based (e.g avro) file formats.
Updates are logged to delta files & later compacted to produce new versions of
columnar files synchronously or
@@ -15,7 +15,7 @@ asynchronously. One of th main motivations behind
Merge-On-Read is to reduce dat
Hence, it makes sense to run compaction asynchronously without blocking
ingestion.
-# Async Compaction
+## Async Compaction
Async Compaction is performed in 2 steps:
@@ -24,11 +24,11 @@ slices** to be compacted. A compaction plan is finally
written to Hudi timeline.
1. ***Compaction Execution***: A separate process reads the compaction plan
and performs compaction of file slices.
-# Deployment Models
+## Deployment Models
There are few ways by which we can execute compactions asynchronously.
-## Spark Structured Streaming
+### Spark Structured Streaming
With 0.6.0, we now have support for running async compactions in Spark
Structured Streaming jobs. Compactions are scheduled and executed
asynchronously inside the
@@ -60,7 +60,7 @@ import org.apache.spark.sql.streaming.ProcessingTime;
writer.trigger(new ProcessingTime(30000)).start(tablePath);
```
-## DeltaStreamer Continuous Mode
+### DeltaStreamer Continuous Mode
Hudi DeltaStreamer provides continuous ingestion mode where a single long
running spark application
ingests data to Hudi table continuously from upstream sources. In this mode,
Hudi supports managing asynchronous
compactions. Here is an example snippet for running in continuous mode with
async compactions
@@ -78,7 +78,7 @@ spark-submit --packages
org.apache.hudi:hudi-utilities-bundle_2.11:0.6.0 \
--continous
```
-## Hudi CLI
+### Hudi CLI
Hudi CLI is yet another way to execute specific compactions asynchronously.
Here is an example
```properties
@@ -86,7 +86,7 @@ hudi:trips->compaction run --tableName <table_name>
--parallelism <parallelism>
...
```
-## Hudi Compactor Script
+### Hudi Compactor Script
Hudi provides a standalone tool to also execute specific compactions
asynchronously. Here is an example
```properties
diff --git
a/website/blog/2020-12-01-high-perf-data-lake-with-hudi-and-alluxio-t3go.md
b/website/blog/2020-12-01-high-perf-data-lake-with-hudi-and-alluxio-t3go.md
index a58b2e585e..75b87a3fb8 100644
--- a/website/blog/2020-12-01-high-perf-data-lake-with-hudi-and-alluxio-t3go.md
+++ b/website/blog/2020-12-01-high-perf-data-lake-with-hudi-and-alluxio-t3go.md
@@ -5,18 +5,18 @@ author: t3go
category: blog
---
-# Building High-Performance Data Lake Using Apache Hudi and Alluxio at T3Go
+## Building High-Performance Data Lake Using Apache Hudi and Alluxio at T3Go
[T3Go](https://www.t3go.cn/) is China’s first platform for smart travel based
on the Internet of Vehicles. In this article, Trevor Zhang and Vino Yang from
T3Go describe the evolution of their data lake architecture, built on
cloud-native or open-source technologies including Alibaba OSS, Apache Hudi,
and Alluxio. Today, their data lake stores petabytes of data, supporting
hundreds of pipelines and tens of thousands of tasks daily. It is essential for
business units at T3Go including Da [...]
In this blog, you will see how we slashed data ingestion time by half using
Hudi and Alluxio. Furthermore, data analysts using Presto, Hudi, and Alluxio
saw the queries speed up by 10 times. We built our data lake based on data
orchestration for multiple stages of our data pipeline, including ingestion and
analytics.
<!--truncate-->
-# I. T3Go data lake Overview
+## I. T3Go data lake Overview
Prior to the data lake, different business units within T3Go managed their own
data processing solutions, utilizing different storage systems, ETL tools, and
data processing frameworks. Data for each became siloed from every other unit,
significantly increasing cost and complexity. Due to the rapid business
expansion of T3Go, this inefficiency became our engineering bottleneck.
We moved to a unified data lake solution based on Alibaba OSS, an object store
similar to AWS S3, to provide a centralized location to store structured and
unstructured data, following the design principles of _Multi-cluster
Shared-data Architecture_; all the applications access OSS storage as the
source of truth, as opposed to different data silos. This architecture allows
us to store the data as-is, without having to first structure the data, and run
different types of analytics to gu [...]
-# II. Efficient Near Real-time Analytics Using Hudi
+## II. Efficient Near Real-time Analytics Using Hudi
Our business in smart travel drives the need to process and analyze data in a
near real-time manner. With a traditional data warehouse, we faced the
following challenges:
@@ -31,21 +31,21 @@ As a result, we adopted Apache Hudi on top of OSS to
address these issues. The f

-## Enable Near real time data ingestion and analysis
+### Enable Near real time data ingestion and analysis
With Hudi, our data lake supports multiple data sources including Kafka, MySQL
binlog, GIS, and other business logs in near real time. As a result, more than
60% of the company’s data is stored in the data lake and this proportion
continues to increase.
We are also able to speed up the data ingestion time down to a few minutes by
introducing Apache Hudi into the data pipeline. Combined with big data
interactive query and analysis framework such as Presto and SparkSQL, real-time
data analysis and insights are achieved.
-## Enable Incremental processing pipeline
+### Enable Incremental processing pipeline
With the help of Hudi, it is possible to provide incremental changes to the
downstream derived table when the upstream table updates frequently. Even with
a large number of interdependent tables, we can quickly run partial data
updates. This also effectively avoids updating the full partitions of cold
tables in the traditional Hive data warehouse.
-## Accessing Data using Hudi as a unified format
+### Accessing Data using Hudi as a unified format
Traditional data warehouses often deploy Hadoop to store data and provide
batch analysis. Kafka is used separately to distribute Hadoop data to other
data processing frameworks, resulting in duplicated data. Hudi helps
effectively solve this problem; we always use Spark pipelines to insert new
updates into the Hudi tables, then incrementally read the update of Hudi
tables. In other words, Hudi tables are used as the unified storage format to
access data.
-# III. Efficient Data Caching Using Alluxio
+## III. Efficient Data Caching Using Alluxio
In the early version of our data lake without Alluxio, data received from
Kafka in real time is processed by Spark and then written to OSS data lake
using Hudi DeltaStreamer tasks. With this architecture, Spark often suffered
high network latency when writing to OSS directly. Since all data is in OSS
storage, OLAP queries on Hudi data may also be slow due to lack of data
locality.
@@ -57,21 +57,21 @@ Data in formats such as Hudi, Parquet, ORC, and JSON are
stored mostly on OSS, c
Specifically, here are a few applications leveraging Alluxio in the T3Go data
lake.
-## Data lake ingestion
+### Data lake ingestion
We mount the corresponding OSS path to the Alluxio file system and set Hudi’s
_“__target-base-path__”_ parameter value to use the alluxio:// scheme in place
of oss:// scheme. Spark pipelines with Hudi continuously ingest data to
Alluxio. After data is written to Alluxio, it is asynchronously persisted from
the Alluxio cache to the remote OSS every minute. These modifications allow
Spark to write to a local Alluxio node instead of writing to remote OSS,
significantly reducing the time f [...]
-## Data analysis on the lake
+### Data analysis on the lake
We use Presto as an ad-hoc query engine to analyze the Hudi tables in the
lake, co-locating Alluxio workers on each Presto worker node. When Presto and
Alluxio services are co-located and running, Alluxio caches the input data
locally in the Presto worker which greatly benefits Presto for subsequent
retrievals. On a cache hit, Presto can read from the local Alluxio worker
storage at memory speed without any additional data transfer over the network.
-## Concurrent accesses across multiple storage systems
+### Concurrent accesses across multiple storage systems
In order to ensure the accuracy of training samples, our machine learning team
often synchronizes desensitized data in production to an offline machine
learning environment. During synchronization, the data flows across multiple
file systems, from production OSS to an offline HDFS followed by another
offline Machine Learning HDFS.
This data migration process is not only inefficient but also error-prune for
modelers because multiple different storages with varying configurations are
involved. Alluxio helps in this specific scenario by mounting the destination
storage systems under the same filesystem to be accessed by their corresponding
logical paths in Alluxio namespace. By decoupling the physical storage, this
allows applications with different APIs to access and transfer data seamlessly.
This data access layout [...]
-## Microbenchmark
+### Microbenchmark
Overall, we observed the following improvements with Alluxio:
@@ -89,12 +89,12 @@ In the stress test shown above, after the data volume is
greater than a certain
Based on our performance benchmarking, we found that the performance can be
improved by over 10 times with the help of Alluxio. Furthermore, the larger the
data scale, the more prominent the performance improvement.
-# IV. Next Step
+## IV. Next Step
As T3Go’s data lake ecosystem expands, we will continue facing the critical
scenario of compute and storage segregation. With T3Go’s growing data
processing needs, our team plans to deploy Alluxio on a larger scale to
accelerate our data lake storage.
In addition to the deployment of Alluxio on the data lake computing engine,
which currently is mainly SparkSQL, we plan to add a layer of Alluxio to the
OLAP cluster using Apache Kylin and an ad_hoc cluster using Presto. The goal is
to have Alluxio cover all computing scenarios, with Alluxio interconnected
between each scene to improve the read and write efficiency of the data lake
and the surrounding lake ecology.
-# V. Conclusion
+## V. Conclusion
As mentioned earlier, Hudi and Alluxio covers all scenarios of Hudi’s near
real-time ingestion, near real-time analysis, incremental processing, and data
distribution on DFS, among many others, and plays the role of a powerful
accelerator on data ingestion and data analysis on the lake. With Hudi and
Alluxio together, **our R&D engineers shortened the time for data ingestion
into the lake by up to a factor of 2. Data analysts using Presto, Hudi, and
Alluxio in conjunction to query data [...]
diff --git a/website/blog/2021-01-27-hudi-clustering-intro.md
b/website/blog/2021-01-27-hudi-clustering-intro.md
index 5f47ffe411..b55d41b33d 100644
--- a/website/blog/2021-01-27-hudi-clustering-intro.md
+++ b/website/blog/2021-01-27-hudi-clustering-intro.md
@@ -5,12 +5,12 @@ author: satish.kotha
category: blog
---
-# Background
+## Background
Apache Hudi brings stream processing to big data, providing fresh data while
being an order of magnitude efficient over traditional batch processing. In a
data lake/warehouse, one of the key trade-offs is between ingestion speed and
query performance. Data ingestion typically prefers small files to improve
parallelism and make data available to queries as soon as possible. However,
query performance degrades poorly with a lot of small files. Also, during
ingestion, data is typically co-l [...]
<!--truncate-->
-# Clustering Architecture
+## Clustering Architecture
At a high level, Hudi provides different operations such as
insert/upsert/bulk_insert through it’s write client API to be able to write
data to a Hudi table. To be able to choose a trade-off between file size and
ingestion speed, Hudi provides a knob `hoodie.parquet.small.file.limit` to be
able to configure the smallest allowable file size. Users are able to configure
the small file [soft
limit](https://hudi.apache.org/docs/configurations#compactionSmallFileSize) to
`0` to force new data [...]
@@ -22,13 +22,13 @@ Clustering table service can run asynchronously or
synchronously adding a new ac
-### Overall, there are 2 parts to clustering
+#### Overall, there are 2 parts to clustering
1. Scheduling clustering: Create a clustering plan using a pluggable
clustering strategy.
2. Execute clustering: Process the plan using an execution strategy to create
new files and replace old files.
-### Scheduling clustering
+#### Scheduling clustering
Following steps are followed to schedule clustering.
@@ -37,7 +37,7 @@ Following steps are followed to schedule clustering.
3. Finally, the clustering plan is saved to the timeline in an avro [metadata
format](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieClusteringPlan.avsc).
-### Running clustering
+#### Running clustering
1. Read the clustering plan and get the ‘clusteringGroups’ that mark the file
groups that need to be clustered.
2. For each group, we instantiate appropriate strategy class with
strategyParams (example: sortColumns) and apply that strategy to rewrite the
data.
@@ -51,7 +51,7 @@ NOTE: Clustering can only be scheduled for tables /
partitions not receiving any

_Figure: Illustrating query performance improvements by clustering_
-### Setting up clustering
+#### Setting up clustering
Inline clustering can be setup easily using spark dataframe options. See
sample below
```scala
@@ -83,7 +83,7 @@ df.write.format("org.apache.hudi").
For more advanced usecases, async clustering pipeline can also be setup. See
an example
[here](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+freshness+and+query+performance#RFC19Clusteringdataforfreshnessandqueryperformance-SetupforAsyncclusteringJob).
-# Table Query Performance
+## Table Query Performance
We created a dataset from one partition of a known production style table with
~20M records and on-disk size of ~200GB. The dataset has rows for multiple
“sessions”. Users always query this data using a predicate on session. Data for
a single session is spread across multiple data files because ingestion groups
data based on arrival time. The below experiment shows that by clustering on
session, we are able to improve the data locality and reduce query execution
time by more than 50%.
@@ -92,14 +92,14 @@ Query:
spark.sql("select * from table where session_id=123")
```
-## Before Clustering
+### Before Clustering
Query took 2.2 minutes to complete. Note that the number of output rows in the
“scan parquet” part of the query plan includes all 20M rows in the table.

_Figure: Spark SQL query details before clustering_
-## After Clustering
+### After Clustering
The query plan is similar to above. But, because of improved data locality and
predicate push down, spark is able to prune a lot of rows. After clustering,
the same query only outputs 110K rows (out of 20M rows) while scanning parquet
files. This cuts query time to less than a minute from 2.2 minutes.
@@ -118,7 +118,7 @@ Query runtime is reduced by 60% after clustering. Similar
results were observed
We expect dramatic speedup for large tables, where the query runtime is almost
entirely dominated by actual I/O and not query planning, unlike the example
above.
-# Summary
+## Summary
Using clustering, we can improve query performance by
1. Leveraging concepts such as [space filling
curves](https://en.wikipedia.org/wiki/Z-order_curve) to adapt data lake layout
and reduce the amount of data read during queries.
diff --git a/website/blog/2021-03-01-hudi-file-sizing.md
b/website/blog/2021-03-01-hudi-file-sizing.md
index 3d3049601d..463ee0259b 100644
--- a/website/blog/2021-03-01-hudi-file-sizing.md
+++ b/website/blog/2021-03-01-hudi-file-sizing.md
@@ -11,7 +11,7 @@ manual table maintenance. Having a lot of small files will
make it harder to ach
having to open/read/close files way too many times, to plan and execute
queries. But for streaming data lake use-cases,
inherently ingests are going to end up having smaller volume of writes, which
might result in lot of small files if no special handling is done.
<!--truncate-->
-# During Write vs After Write
+## During Write vs After Write
Common approaches to writing very small files and then later stitching them
together solve for system scalability issues posed
by small files but might violate query SLA's by exposing small files to them.
In fact, you can easily do so on a Hudi table,
@@ -25,7 +25,7 @@ Hudi has the ability to maintain a configured target file
size, when performing
(Note: bulk_insert operation does not provide this functionality and is
designed as a simpler replacement for
normal `spark.write.parquet`).
-## Configs
+### Configs
For illustration purposes, we are going to consider only COPY_ON_WRITE table.
@@ -41,7 +41,7 @@ would be considered a small file.
If you wish to turn off this feature, set the config value for soft file limit
to 0.
-## Example
+### Example
Let’s say this is the layout of data files for a given partition.
diff --git a/website/blog/2021-08-18-virtual-keys.md
b/website/blog/2021-08-18-virtual-keys.md
index c1ce8b5b09..57e44da270 100644
--- a/website/blog/2021-08-18-virtual-keys.md
+++ b/website/blog/2021-08-18-virtual-keys.md
@@ -13,13 +13,13 @@ In addition, it ensures data quality by ensuring unique key
constraints are enfo
But one of the repeated asks from the community is to leverage existing fields
and not to add additional meta fields, for simple use-cases where such benefits
are not desired or key changes are very rare.
<!--truncate-->
-# Virtual Key support
+## Virtual Key support
Hudi now supports virtual keys, where Hudi meta fields can be computed on
demand from the data fields. Currently, the meta fields are
computed once and stored as per record metadata and re-used across various
operations. If one does not need incremental query support,
they can start leveraging Hudi's Virtual key support and still go about using
Hudi to build and manage their data lake to reduce the storage
overhead due to per record metadata.
-## Configurations
+### Configurations
Virtual keys can be enabled for a given table using the below config. When set
to `hoodie.populate.meta.fields=false`,
Hudi will use virtual keys for the corresponding table. Default value for this
config is `true`, which means, all meta fields will be added by default.
@@ -36,24 +36,24 @@ would entail reading all fields out of base and delta logs,
sacrificing core col
for users. Thus, we support only simple key generators (the default key
generator, where both record key and partition path refer
to an existing field ) for now.
-### Supported Key Generators with CopyOnWrite(COW) table:
+#### Supported Key Generators with CopyOnWrite(COW) table:
SimpleKeyGenerator, ComplexKeyGenerator, CustomKeyGenerator,
TimestampBasedKeyGenerator and NonPartitionedKeyGenerator.
-### Supported Key Generators with MergeOnRead(MOR) table:
+#### Supported Key Generators with MergeOnRead(MOR) table:
SimpleKeyGenerator
-### Supported Index types:
+#### Supported Index types:
Only "SIMPLE" and "GLOBAL_SIMPLE" index types are supported in the first cut.
We plan to add support for other index
(BLOOM, etc) in future releases.
-## Supported Operations
+### Supported Operations
All existing features are supported for a hudi table with virtual keys, except
the incremental
queries. Which means, cleaning, archiving, metadata table, clustering, etc can
be enabled for a hudi table with
virtual keys enabled. So, you are able to merely use Hudi as a transactional
table format with all the awesome
table service runtimes and platform services, if you wish to do so, without
incurring any overheads associated with
support for incremental data processing.
-## Sample Output
+### Sample Output
As called out earlier, one has to set `hoodie.populate.meta.fields=false` to
enable virtual keys. Let's see the
difference between records of a hudi table with and without virtual keys.
@@ -99,7 +99,7 @@ And here are some sample records for a hudi table with
virtual keys enabled.
As you could see, all meta fields are null in storage, but all users fields
remain intact similar to a regular table.
:::
-## Incremental Queries
+### Incremental Queries
Since hudi does not maintain any metadata (like commit time at a record level)
for a table with virtual keys enabled,
incremental queries are not supported. An exception will be thrown as below
when an incremental query is triggered for such
a table.
@@ -121,7 +121,7 @@ org.apache.hudi.exception.HoodieException: Incremental
queries are not supported
... 61 elided
```
-## Conclusion
+### Conclusion
Hope this blog was useful for you to learn yet another feature in Apache Hudi.
If you are interested in
Hudi and looking to contribute, do check out
[here](https://hudi.apache.org/contribute/get-involved).
diff --git
a/website/blog/2021-12-16-lakehouse-concurrency-control-are-we-too-optimistic.md
b/website/blog/2021-12-16-lakehouse-concurrency-control-are-we-too-optimistic.md
index f45ace997a..1072ace5f7 100644
---
a/website/blog/2021-12-16-lakehouse-concurrency-control-are-we-too-optimistic.md
+++
b/website/blog/2021-12-16-lakehouse-concurrency-control-are-we-too-optimistic.md
@@ -14,18 +14,17 @@ Having had the good fortune of working on diverse database
projects - an RDBMS (
First, let's set the record straight. RDBMS databases offer the richest set of
transactional capabilities and the widest array of concurrency control
[mechanisms](https://dev.mysql.com/doc/refman/5.7/en/innodb-locking-transaction-model.html).
Different isolation levels, fine grained locking, deadlock
detection/avoidance, and more are possible because they have to support
row-level mutations and reads across many tables while enforcing [key
constraints](https://dev.mysql.com/doc/refman/8. [...]
-# Pitfalls in Lake Concurrency Control
+### Pitfalls in Lake Concurrency Control
Historically, data lakes have been viewed as batch jobs reading/writing files
on cloud storage and it's interesting to see how most new work extends this
view and implements glorified file version control using some form of
"[**Optimistic concurrency
control**](https://en.wikipedia.org/wiki/Optimistic_concurrency_control)"
(OCC). With OCC jobs take a table level lock to check if they have impacted
overlapping files and if a conflict exists, they abort their operations
completely. Without [...]
Imagine a real-life scenario of two writer processes : an ingest writer job
producing new data every 30 minutes and a deletion writer job that is enforcing
GDPR, taking 2 hours to issue deletes. It's very likely for these to overlap
files with random deletes, and the deletion job is almost guaranteed to starve
and fail to commit each time. In database speak, mixing long running
transactions with optimism leads to disappointment, since the longer the
transactions the higher the probabilit [...]

-
static/assets/images/blog/concurrency/ConcurrencyControlConflicts.png
So, what's the alternative? Locking? Wikipedia also says - "_However,
locking-based ("pessimistic") methods also can deliver poor performance because
locking can drastically limit effective concurrency even when deadlocks are
avoided."._ Here is where Hudi takes a different approach, that we believe is
more apt for modern lake transactions which are typically long-running and even
continuous. Data lake workloads share more characteristics with high throughput
stream processing jobs, than [...]
-# Model 1 : Single Writer, Inline Table Services
+### Model 1 : Single Writer, Inline Table Services
The simplest form of concurrency control is just no concurrency at all. A data
lake table often has common services operating on it to ensure efficiency.
Reclaiming storage space from older versions and logs, coalescing files
(clustering in Hudi), merging deltas (compactions in Hudi), and more. Hudi can
simply eliminate the need for concurrency control and maximizes throughput by
supporting these table services out-of-box and running inline after every write
to the table.
@@ -33,13 +32,13 @@ Execution plans are idempotent, persisted to the timeline
and auto-recover from

-# Model 2 : Single Writer, Async Table Services
+### Model 2 : Single Writer, Async Table Services
Our delete/ingest example above is n't really that simple. While ingest/writer
may just be updating the last N partitions on the table, delete may span across
the entire table even. Mixing them in the same job, could slow down ingest
latency by a lot. But, Hudi provides the option of running the table services
in an async fashion, where most of the heavy lifting (e.g actually rewriting
the columnar data by compaction service) is done asynchronously, eliminating
any repeated wasteful retr [...]

-# Model 3 : Multiple Writers
+### Model 3 : Multiple Writers
But it's not always possible to serialize the deletes into the same write
stream or sql based deletes are required. With multiple distributed processes,
some form of locking is inevitable, but like real databases Hudi's concurrency
model is intelligent enough to differentiate actual writing to the table, from
table services that manage or optimize the table. Hudi offers similar
optimistic concurrency control across multiple writers, but table services can
still execute completely lock-fr [...]
diff --git
a/website/blog/2021-12-29-hudi-zorder-and-hilbert-space-filling-curves.md
b/website/blog/2021-12-29-hudi-zorder-and-hilbert-space-filling-curves.md
index 2ebe72fea6..cda7b5c66e 100644
--- a/website/blog/2021-12-29-hudi-zorder-and-hilbert-space-filling-curves.md
+++ b/website/blog/2021-12-29-hudi-zorder-and-hilbert-space-filling-curves.md
@@ -10,7 +10,7 @@ As of Hudi v0.10.0, we are excited to introduce support for
an advanced Data Lay
<!--truncate-->
-## Background
+### Background
Amazon EMR team recently published a [great
article](https://aws.amazon.com/blogs/big-data/new-features-from-apache-hudi-0-7-0-and-0-8-0-available-on-amazon-emr/)
show-casing how [clustering](https://hudi.apache.org/docs/clustering) your
data can improve your _query performance_.
@@ -71,7 +71,7 @@ In a similar fashion, Hilbert curves also allow you to map
points in a N-dimensi
Now, let's check it out in action!
-# Setup
+### Setup
We will use the [Amazon
Reviews](https://s3.amazonaws.com/amazon-reviews-pds/readme.html) dataset
again, but this time we will use Hudi to Z-Order by `product_id`, `customer_id`
columns tuple instead of Clustering or _linear ordering_.
No special preparations are required for the dataset, you can simply download
it from [S3](https://s3.amazonaws.com/amazon-reviews-pds/readme.html) in
Parquet format and use it directly as an input for Spark ingesting it into Hudi
table.
@@ -150,7 +150,7 @@ df.write.format("hudi")
-# Testing
+### Testing
Please keep in mind, that each individual test is run in a separate
spark-shell to avoid caching getting in the way of our measurements.
```scala
@@ -300,7 +300,7 @@ scala> runQuery3(dataSkippingSnapshotTableName)
+-----------+-----------+
```
-# Results
+### Results
We've summarized the measured performance metrics below:
| **Query** | **Baseline (B)** duration (files scanned / size) | **Linear
Sorting (S)** | **Z-order (Z)** duration (scanned) | **Hilbert (H)** duration
(scanned) |
@@ -315,6 +315,6 @@ Which is a very clear contrast with space-filling curves
(both Z-order and Hilbe
It's worth noting that the performance gains are heavily dependent on your
underlying data and queries. In benchmarks on our internal data we were able to
achieve queries performance improvements of more than **11x!**
-# Epilogue
+### Epilogue
Apache Hudi v0.10 brings new layout optimization capabilities Z-order and
Hilbert to open source. Using these industry leading layout optimization
techniques can bring substantial performance improvement and cost savings to
your queries!
diff --git
a/website/blog/2022-04-04-New-features-from-Apache-Hudi-0.9.0-on-Amazon-EMR.mdx
b/website/blog/2022-04-04-New-features-from-Apache-Hudi-0.9.0-on-Amazon-EMR.mdx
index 9fa5b5abb9..538145b3a9 100644
---
a/website/blog/2022-04-04-New-features-from-Apache-Hudi-0.9.0-on-Amazon-EMR.mdx
+++
b/website/blog/2022-04-04-New-features-from-Apache-Hudi-0.9.0-on-Amazon-EMR.mdx
@@ -5,6 +5,7 @@ authors:
- name: Gabriele Cacciola
- name: Udit Mehrotra
category: blog
+image: /assets/images/powers/aws.jpg
---
import Redirect from '@site/src/components/Redirect';
diff --git
a/website/static/assets/images/blog/2022-01-25-cost-efficiency-at-scale-in-big-data-file-format.png
b/website/static/assets/images/blog/2022-01-25-cost-efficiency-at-scale-in-big-data-file-format.png
new file mode 100644
index 0000000000..34e818702f
Binary files /dev/null and
b/website/static/assets/images/blog/2022-01-25-cost-efficiency-at-scale-in-big-data-file-format.png
differ
diff --git
a/website/static/assets/images/blog/2022-02-02-onehouse-commitment-to-openness.jpeg
b/website/static/assets/images/blog/2022-02-02-onehouse-commitment-to-openness.jpeg
new file mode 100644
index 0000000000..a836cf7365
Binary files /dev/null and
b/website/static/assets/images/blog/2022-02-02-onehouse-commitment-to-openness.jpeg
differ
diff --git
a/website/static/assets/images/blog/2022-02-03-onehouse_billboard.png
b/website/static/assets/images/blog/2022-02-03-onehouse_billboard.png
new file mode 100644
index 0000000000..86f44ee020
Binary files /dev/null and
b/website/static/assets/images/blog/2022-02-03-onehouse_billboard.png differ
diff --git
a/website/static/assets/images/blog/2022-02-17-fresher-data-lake-on-aws-s3.png
b/website/static/assets/images/blog/2022-02-17-fresher-data-lake-on-aws-s3.png
new file mode 100644
index 0000000000..624264dacb
Binary files /dev/null and
b/website/static/assets/images/blog/2022-02-17-fresher-data-lake-on-aws-s3.png
differ
diff --git
a/website/static/assets/images/blog/2022-03-01-low-latency-pipeline-using-msk-flink-hudi.png
b/website/static/assets/images/blog/2022-03-01-low-latency-pipeline-using-msk-flink-hudi.png
new file mode 100644
index 0000000000..9e95594741
Binary files /dev/null and
b/website/static/assets/images/blog/2022-03-01-low-latency-pipeline-using-msk-flink-hudi.png
differ
diff --git
a/website/static/assets/images/blog/2022-03-09-serverless-pipeline-using-glue-hudi-s3.png
b/website/static/assets/images/blog/2022-03-09-serverless-pipeline-using-glue-hudi-s3.png
new file mode 100644
index 0000000000..118839e543
Binary files /dev/null and
b/website/static/assets/images/blog/2022-03-09-serverless-pipeline-using-glue-hudi-s3.png
differ
diff --git
a/website/static/assets/images/blog/2022-04-04-halodoc-lakehouse-architecture.png
b/website/static/assets/images/blog/2022-04-04-halodoc-lakehouse-architecture.png
new file mode 100644
index 0000000000..134fb77f64
Binary files /dev/null and
b/website/static/assets/images/blog/2022-04-04-halodoc-lakehouse-architecture.png
differ
diff --git a/website/static/assets/images/blog/2022-05-17-multimodal-index.gif
b/website/static/assets/images/blog/2022-05-17-multimodal-index.gif
new file mode 100644
index 0000000000..3e705205a6
Binary files /dev/null and
b/website/static/assets/images/blog/2022-05-17-multimodal-index.gif differ