[hudi] branch asf-site updated: [DOCS] Add image assets and fix blog post styles (#5613)

bhavanisudha Tue, 17 May 2022 15:37:37 -0700

This is an automated email from the ASF dual-hosted git repository.

bhavanisudha pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 3e85f4d80e [DOCS] Add image assets and fix blog post styles (#5613)
3e85f4d80e is described below

commit 3e85f4d80ee8fb5a364e67de56edc066ca2329e3
Author: Bhavani Sudha Saktheeswaran <[email protected]>
AuthorDate: Tue May 17 15:37:25 2022 -0700

    [DOCS] Add image assets and fix blog post styles (#5613)
    
    Co-authored-by: Bhavani Sudha Saktheeswaran <[email protected]>
---
 ...-efficient-migration-of-large-parquet-tables.md |  22 ++++++++---------
 ...2020-08-21-async-compaction-deployment-model.md |  14 +++++------
 ...gh-perf-data-lake-with-hudi-and-alluxio-t3go.md |  26 ++++++++++-----------
 website/blog/2021-01-27-hudi-clustering-intro.md   |  20 ++++++++--------
 website/blog/2021-03-01-hudi-file-sizing.md        |   6 ++---
 website/blog/2021-08-18-virtual-keys.md            |  18 +++++++-------
 ...se-concurrency-control-are-we-too-optimistic.md |   9 ++++---
 ...hudi-zorder-and-hilbert-space-filling-curves.md |  10 ++++----
 ...atures-from-Apache-Hudi-0.9.0-on-Amazon-EMR.mdx |   1 +
 ...efficiency-at-scale-in-big-data-file-format.png | Bin 0 -> 38755 bytes
 ...2022-02-02-onehouse-commitment-to-openness.jpeg | Bin 0 -> 386047 bytes
 .../images/blog/2022-02-03-onehouse_billboard.png  | Bin 0 -> 554823 bytes
 .../2022-02-17-fresher-data-lake-on-aws-s3.png     | Bin 0 -> 96170 bytes
 ...1-low-latency-pipeline-using-msk-flink-hudi.png | Bin 0 -> 40488 bytes
 ...3-09-serverless-pipeline-using-glue-hudi-s3.png | Bin 0 -> 142433 bytes
 .../2022-04-04-halodoc-lakehouse-architecture.png  | Bin 0 -> 251301 bytes
 .../images/blog/2022-05-17-multimodal-index.gif    | Bin 0 -> 607295 bytes
 17 files changed, 63 insertions(+), 63 deletions(-)

diff --git 
a/website/blog/2020-08-20-efficient-migration-of-large-parquet-tables.md 
b/website/blog/2020-08-20-efficient-migration-of-large-parquet-tables.md
index cd959ce28f..75144dc907 100644
--- a/website/blog/2020-08-20-efficient-migration-of-large-parquet-tables.md
+++ b/website/blog/2020-08-20-efficient-migration-of-large-parquet-tables.md
@@ -8,14 +8,14 @@ category: blog
 We will look at how to migrate a large parquet table to Hudi without having to 
rewrite the entire dataset. 
 
 <!--truncate-->
-# Motivation:
+## Motivation:
 
 Apache Hudi maintains per record metadata to perform core operations such as 
upserts and incremental pull. To take advantage of Hudi’s upsert and 
incremental processing support, users would need to rewrite their whole dataset 
to make it an Apache Hudi table.  Hudi 0.6.0 comes with an ***experimental 
feature*** to support efficient migration of large Parquet tables to Hudi 
without the need to rewrite the entire dataset.
 
 
-# High Level Idea:
+## High Level Idea:
 
-## Per Record Metadata:
+### Per Record Metadata:
 
 Apache Hudi maintains record level metadata for perform efficient upserts and 
incremental pull.
 
@@ -31,11 +31,11 @@ The parts (1) and (3) constitute what we term as  “Hudi 
skeleton”. Hudi skel
 
 ![skeleton](/assets/images/blog/2020-08-20-skeleton.png)
 
-# Design Deep Dive:
+## Design Deep Dive:
 
  For a deep dive on the internals, please take a look at the [RFC 
document](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+12+%3A+Efficient+Migration+of+Large+Parquet+Tables+to+Apache+Hudi)
 
 
-# Migration:
+## Migration:
 
 Hudi supports 2 modes when migrating parquet tables.  We will use the term 
bootstrap and migration interchangeably in this document.  
 
@@ -45,10 +45,10 @@ Hudi supports 2 modes when migrating parquet tables.  We 
will use the term boots
 You can pick and choose these modes at partition level. One of the common 
strategy would be to use FULL_RECORD mode for a small set of "hot" partitions 
which are accessed more frequently and METADATA_ONLY for a larger set of "warm" 
partitions. 
 
 
-## Query Engine Support:
+### Query Engine Support:
 For a METADATA_ONLY bootstrapped table, Spark - data source, Spark-Hive and 
native Hive query engines are supported. Presto support is in the works.
 
-## Ways To Migrate :
+### Ways To Migrate :
 
 There are 2 ways to migrate a large parquet table to Hudi. 
 
@@ -57,7 +57,7 @@ There are 2 ways to migrate a large parquet table to Hudi.
 
 We will look at how to migrate using both these approaches.
 
-## Configurations:
+### Configurations:
 
 These are bootstrap specific configurations that needs to be set in addition 
to regular hudi write configurations.
 
@@ -73,7 +73,7 @@ These are bootstrap specific configurations that needs to be 
set in addition to
 | hoodie.bootstrap.mode.selector.regex.mode |METADATA_ONLY |No |Bootstrap Mode 
used when the partition matches the regex pattern in 
hoodie.bootstrap.mode.selector.regex . Used only when 
hoodie.bootstrap.mode.selector set to BootstrapRegexModeSelector. |
 | hoodie.bootstrap.mode.selector.regex |\.\* |No |Partition Regex used when  
hoodie.bootstrap.mode.selector set to BootstrapRegexModeSelector. |
 
-## Spark Data Source:
+### Spark Data Source:
 
 Here, we use a Spark Datasource Write to perform bootstrap. 
 Here is an example code snippet to perform METADATA_ONLY bootstrap.
@@ -127,7 +127,7 @@ bootstrapDF.write
       .save(basePath)
 ```
 
-## Hoodie DeltaStreamer:
+### Hoodie DeltaStreamer:
 
 Hoodie Deltastreamer allows bootstrap to be performed using --run-bootstrap 
command line option.
 
@@ -170,6 +170,6 @@ spark-submit --package 
org.apache.hudi:hudi-spark-bundle_2.11:0.6.0
 --hoodie-conf hoodie.bootstrap.mode.selector.regex.mode=METADATA_ONLY
 ```
 
-## Known Caveats
+### Known Caveats
 1. Need proper defaults for the bootstrap config : 
hoodie.bootstrap.full.input.provider. Here is the 
[ticket](https://issues.apache.org/jira/browse/HUDI-1213)
 1. DeltaStreamer manages checkpoints inside hoodie commit files and expects 
checkpoints in previously committed metadata. Users are expected to pass 
checkpoint or initial checkpoint provider when performing bootstrap through 
deltastreamer. Such support is not present when doing bootstrap using Spark 
Datasource. Here is the 
[ticket](https://issues.apache.org/jira/browse/HUDI-1214).
diff --git a/website/blog/2020-08-21-async-compaction-deployment-model.md 
b/website/blog/2020-08-21-async-compaction-deployment-model.md
index 3ffa1b4508..5e6eec2657 100644
--- a/website/blog/2020-08-21-async-compaction-deployment-model.md
+++ b/website/blog/2020-08-21-async-compaction-deployment-model.md
@@ -7,7 +7,7 @@ category: blog
 
 We will look at different deployment models for executing compactions 
asynchronously.
 <!--truncate-->
-# Compaction
+## Compaction
 
 For Merge-On-Read table, data is stored using a combination of columnar (e.g 
parquet) + row based (e.g avro) file formats. 
 Updates are logged to delta files & later compacted to produce new versions of 
columnar files synchronously or 
@@ -15,7 +15,7 @@ asynchronously. One of th main motivations behind 
Merge-On-Read is to reduce dat
 Hence, it makes sense to run compaction asynchronously without blocking 
ingestion.
 
 
-# Async Compaction
+## Async Compaction
 
 Async Compaction is performed in 2 steps:
 
@@ -24,11 +24,11 @@ slices** to be compacted. A compaction plan is finally 
written to Hudi timeline.
 1. ***Compaction Execution***: A separate process reads the compaction plan 
and performs compaction of file slices.
 
   
-# Deployment Models
+## Deployment Models
 
 There are few ways by which we can execute compactions asynchronously. 
 
-## Spark Structured Streaming
+### Spark Structured Streaming
 
 With 0.6.0, we now have support for running async compactions in Spark 
 Structured Streaming jobs. Compactions are scheduled and executed 
asynchronously inside the 
@@ -60,7 +60,7 @@ import org.apache.spark.sql.streaming.ProcessingTime;
  writer.trigger(new ProcessingTime(30000)).start(tablePath);
 ```
 
-## DeltaStreamer Continuous Mode
+### DeltaStreamer Continuous Mode
 Hudi DeltaStreamer provides continuous ingestion mode where a single long 
running spark application  
 ingests data to Hudi table continuously from upstream sources. In this mode, 
Hudi supports managing asynchronous 
 compactions. Here is an example snippet for running in continuous mode with 
async compactions
@@ -78,7 +78,7 @@ spark-submit --packages 
org.apache.hudi:hudi-utilities-bundle_2.11:0.6.0 \
 --continous
 ```
 
-## Hudi CLI
+### Hudi CLI
 Hudi CLI is yet another way to execute specific compactions asynchronously. 
Here is an example 
 
 ```properties
@@ -86,7 +86,7 @@ hudi:trips->compaction run --tableName <table_name> 
--parallelism <parallelism>
 ...
 ```
 
-## Hudi Compactor Script
+### Hudi Compactor Script
 Hudi provides a standalone tool to also execute specific compactions 
asynchronously. Here is an example
 
 ```properties
diff --git 
a/website/blog/2020-12-01-high-perf-data-lake-with-hudi-and-alluxio-t3go.md 
b/website/blog/2020-12-01-high-perf-data-lake-with-hudi-and-alluxio-t3go.md
index a58b2e585e..75b87a3fb8 100644
--- a/website/blog/2020-12-01-high-perf-data-lake-with-hudi-and-alluxio-t3go.md
+++ b/website/blog/2020-12-01-high-perf-data-lake-with-hudi-and-alluxio-t3go.md
@@ -5,18 +5,18 @@ author: t3go
 category: blog
 ---
 
-# Building High-Performance Data Lake Using Apache Hudi and Alluxio at T3Go
+## Building High-Performance Data Lake Using Apache Hudi and Alluxio at T3Go
 [T3Go](https://www.t3go.cn/)  is China’s first platform for smart travel based 
on the Internet of Vehicles. In this article, Trevor Zhang and Vino Yang from 
T3Go describe the evolution of their data lake architecture, built on 
cloud-native or open-source technologies including Alibaba OSS, Apache Hudi, 
and Alluxio. Today, their data lake stores petabytes of data, supporting 
hundreds of pipelines and tens of thousands of tasks daily. It is essential for 
business units at T3Go including Da [...]
 
 In this blog, you will see how we slashed data ingestion time by half using 
Hudi and Alluxio. Furthermore, data analysts using Presto, Hudi, and Alluxio 
saw the queries speed up by 10 times. We built our data lake based on data 
orchestration for multiple stages of our data pipeline, including ingestion and 
analytics.
 <!--truncate-->
-# I. T3Go data lake Overview
+## I. T3Go data lake Overview
 
 Prior to the data lake, different business units within T3Go managed their own 
data processing solutions, utilizing different storage systems, ETL tools, and 
data processing frameworks. Data for each became siloed from every other unit, 
significantly increasing cost and complexity. Due to the rapid business 
expansion of T3Go, this inefficiency became our engineering bottleneck.
 
 We moved to a unified data lake solution based on Alibaba OSS, an object store 
similar to AWS S3, to provide a centralized location to store structured and 
unstructured data, following the design principles of  _Multi-cluster 
Shared-data Architecture_; all the applications access OSS storage as the 
source of truth, as opposed to different data silos. This architecture allows 
us to store the data as-is, without having to first structure the data, and run 
different types of analytics to gu [...]
 
-# II. Efficient Near Real-time Analytics Using Hudi
+## II. Efficient Near Real-time Analytics Using Hudi
 
 Our business in smart travel drives the need to process and analyze data in a 
near real-time manner. With a traditional data warehouse, we faced the 
following challenges:  
 
@@ -31,21 +31,21 @@ As a result, we adopted Apache Hudi on top of OSS to 
address these issues. The f
 
 ![architecture](/assets/images/blog/2020-12-01-t3go-architecture.png)
 
-## Enable Near real time data ingestion and analysis
+### Enable Near real time data ingestion and analysis
 
 With Hudi, our data lake supports multiple data sources including Kafka, MySQL 
binlog, GIS, and other business logs in near real time. As a result, more than 
60% of the company’s data is stored in the data lake and this proportion 
continues to increase.
 
 We are also able to speed up the data ingestion time down to a few minutes by 
introducing Apache Hudi into the data pipeline. Combined with big data 
interactive query and analysis framework such as Presto and SparkSQL, real-time 
data analysis and insights are achieved.
 
-## Enable Incremental processing pipeline
+### Enable Incremental processing pipeline
 
 With the help of Hudi, it is possible to provide incremental changes to the 
downstream derived table when the upstream table updates frequently. Even with 
a large number of interdependent tables, we can quickly run partial data 
updates. This also effectively avoids updating the full partitions of cold 
tables in the traditional Hive data warehouse.
 
-## Accessing Data using Hudi as a unified format
+### Accessing Data using Hudi as a unified format
 
 Traditional data warehouses often deploy Hadoop to store data and provide 
batch analysis. Kafka is used separately to distribute Hadoop data to other 
data processing frameworks, resulting in duplicated data. Hudi helps 
effectively solve this problem; we always use Spark pipelines to insert new 
updates into the Hudi tables, then incrementally read the update of Hudi 
tables. In other words, Hudi tables are used as the unified storage format to 
access data.
 
-# III. Efficient Data Caching Using Alluxio
+## III. Efficient Data Caching Using Alluxio
 
 In the early version of our data lake without Alluxio, data received from 
Kafka in real time is processed by Spark and then written to OSS data lake 
using Hudi DeltaStreamer tasks. With this architecture, Spark often suffered 
high network latency when writing to OSS directly. Since all data is in OSS 
storage, OLAP queries on Hudi data may also be slow due to lack of data 
locality.
 
@@ -57,21 +57,21 @@ Data in formats such as Hudi, Parquet, ORC, and JSON are 
stored mostly on OSS, c
 
 Specifically, here are a few applications leveraging Alluxio in the T3Go data 
lake.
 
-## Data lake ingestion
+### Data lake ingestion
 
 We mount the corresponding OSS path to the Alluxio file system and set Hudi’s  
_“__target-base-path__”_  parameter value to use the alluxio:// scheme in place 
of oss:// scheme. Spark pipelines with Hudi continuously ingest data to 
Alluxio. After data is written to Alluxio, it is asynchronously persisted from 
the Alluxio cache to the remote OSS every minute. These modifications allow 
Spark to write to a local Alluxio node instead of writing to remote OSS, 
significantly reducing the time f [...]
 
-## Data analysis on the lake
+### Data analysis on the lake
 
 We use Presto as an ad-hoc query engine to analyze the Hudi tables in the 
lake, co-locating Alluxio workers on each Presto worker node. When Presto and 
Alluxio services are co-located and running, Alluxio caches the input data 
locally in the Presto worker which greatly benefits Presto for subsequent 
retrievals. On a cache hit, Presto can read from the local Alluxio worker 
storage at memory speed without any additional data transfer over the network.
 
-## Concurrent accesses across multiple storage systems
+### Concurrent accesses across multiple storage systems
 
 In order to ensure the accuracy of training samples, our machine learning team 
often synchronizes desensitized data in production to an offline machine 
learning environment. During synchronization, the data flows across multiple 
file systems, from production OSS to an offline HDFS followed by another 
offline Machine Learning HDFS.
 
 This data migration process is not only inefficient but also error-prune for 
modelers because multiple different storages with varying configurations are 
involved. Alluxio helps in this specific scenario by mounting the destination 
storage systems under the same filesystem to be accessed by their corresponding 
logical paths in Alluxio namespace. By decoupling the physical storage, this 
allows applications with different APIs to access and transfer data seamlessly. 
This data access layout [...]
 
-## Microbenchmark
+### Microbenchmark
 
 Overall, we observed the following improvements with Alluxio:
 
@@ -89,12 +89,12 @@ In the stress test shown above, after the data volume is 
greater than a certain
 
 Based on our performance benchmarking, we found that the performance can be 
improved by over 10 times with the help of Alluxio. Furthermore, the larger the 
data scale, the more prominent the performance improvement.
 
-# IV. Next Step
+## IV. Next Step
 
 As T3Go’s data lake ecosystem expands, we will continue facing the critical 
scenario of compute and storage segregation. With T3Go’s growing data 
processing needs, our team plans to deploy Alluxio on a larger scale to 
accelerate our data lake storage.
 
 In addition to the deployment of Alluxio on the data lake computing engine, 
which currently is mainly SparkSQL, we plan to add a layer of Alluxio to the 
OLAP cluster using Apache Kylin and an ad_hoc cluster using Presto. The goal is 
to have Alluxio cover all computing scenarios, with Alluxio interconnected 
between each scene to improve the read and write efficiency of the data lake 
and the surrounding lake ecology.
 
-# V. Conclusion
+## V. Conclusion
 
 As mentioned earlier, Hudi and Alluxio covers all scenarios of Hudi’s near 
real-time ingestion, near real-time analysis, incremental processing, and data 
distribution on DFS, among many others, and plays the role of a powerful 
accelerator on data ingestion and data analysis on the lake. With Hudi and 
Alluxio together,  **our R&D engineers shortened the time for data ingestion 
into the lake by up to a factor of 2. Data analysts using Presto, Hudi, and 
Alluxio in conjunction to query data  [...]
diff --git a/website/blog/2021-01-27-hudi-clustering-intro.md 
b/website/blog/2021-01-27-hudi-clustering-intro.md
index 5f47ffe411..b55d41b33d 100644
--- a/website/blog/2021-01-27-hudi-clustering-intro.md
+++ b/website/blog/2021-01-27-hudi-clustering-intro.md
@@ -5,12 +5,12 @@ author: satish.kotha
 category: blog
 ---
 
-# Background
+## Background
 
 Apache Hudi brings stream processing to big data, providing fresh data while 
being an order of magnitude efficient over traditional batch processing. In a 
data lake/warehouse, one of the key trade-offs is between ingestion speed and 
query performance. Data ingestion typically prefers small files to improve 
parallelism and make data available to queries as soon as possible. However, 
query performance degrades poorly with a lot of small files. Also, during 
ingestion, data is typically co-l [...]
 <!--truncate-->
 
-# Clustering Architecture
+## Clustering Architecture
 
 At a high level, Hudi provides different operations such as 
insert/upsert/bulk_insert through it’s write client API to be able to write 
data to a Hudi table. To be able to choose a trade-off between file size and 
ingestion speed, Hudi provides a knob `hoodie.parquet.small.file.limit` to be 
able to configure the smallest allowable file size. Users are able to configure 
the small file [soft 
limit](https://hudi.apache.org/docs/configurations#compactionSmallFileSize) to 
`0` to force new data [...]
 
@@ -22,13 +22,13 @@ Clustering table service can run asynchronously or 
synchronously adding a new ac
 
   
 
-### Overall, there are 2 parts to clustering
+#### Overall, there are 2 parts to clustering
 
 1.  Scheduling clustering: Create a clustering plan using a pluggable 
clustering strategy.
 2.  Execute clustering: Process the plan using an execution strategy to create 
new files and replace old files.
     
 
-### Scheduling clustering
+#### Scheduling clustering
 
 Following steps are followed to schedule clustering.
 
@@ -37,7 +37,7 @@ Following steps are followed to schedule clustering.
 3.  Finally, the clustering plan is saved to the timeline in an avro [metadata 
format](https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieClusteringPlan.avsc).
     
 
-### Running clustering
+#### Running clustering
 
 1.  Read the clustering plan and get the ‘clusteringGroups’ that mark the file 
groups that need to be clustered.
 2.  For each group, we instantiate appropriate strategy class with 
strategyParams (example: sortColumns) and apply that strategy to rewrite the 
data.
@@ -51,7 +51,7 @@ NOTE: Clustering can only be scheduled for tables / 
partitions not receiving any
 ![Clustering 
example](/assets/images/blog/clustering/example_perf_improvement.png)
 _Figure: Illustrating query performance improvements by clustering_
 
-### Setting up clustering
+#### Setting up clustering
 Inline clustering can be setup easily using spark dataframe options. See 
sample below
 
 ```scala
@@ -83,7 +83,7 @@ df.write.format("org.apache.hudi").
 For more advanced usecases, async clustering pipeline can also be setup. See 
an example 
[here](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+freshness+and+query+performance#RFC19Clusteringdataforfreshnessandqueryperformance-SetupforAsyncclusteringJob).
 
 
-# Table Query Performance
+## Table Query Performance
 
 We created a dataset from one partition of a known production style table with 
~20M records and on-disk size of ~200GB. The dataset has rows for multiple 
“sessions”. Users always query this data using a predicate on session. Data for 
a single session is spread across multiple data files because ingestion groups 
data based on arrival time. The below experiment shows that by clustering on 
session, we are able to improve the data locality and reduce query execution 
time by more than 50%.
 
@@ -92,14 +92,14 @@ Query:
 spark.sql("select  *  from table where session_id=123")
 ```
 
-## Before Clustering
+### Before Clustering
 
 Query took 2.2 minutes to complete. Note that the number of output rows in the 
“scan parquet” part of the query plan includes all 20M rows in the table.
 
 ![Query Plan Before 
Clustering](/assets/images/blog/clustering/Query_Plan_Before_Clustering.png)
 _Figure: Spark SQL query details before clustering_
 
-## After Clustering
+### After Clustering
 
 The query plan is similar to above. But, because of improved data locality and 
predicate push down, spark is able to prune a lot of rows. After clustering, 
the same query only outputs 110K rows (out of 20M rows) while scanning parquet 
files. This cuts query time to less than a minute from 2.2 minutes.
 
@@ -118,7 +118,7 @@ Query runtime is reduced by 60% after clustering. Similar 
results were observed
 
 We expect dramatic speedup for large tables, where the query runtime is almost 
entirely dominated by actual I/O and not query planning, unlike the example 
above.
 
-# Summary
+## Summary
 
 Using clustering, we can improve query performance by
 1.  Leveraging concepts such as [space filling 
curves](https://en.wikipedia.org/wiki/Z-order_curve) to adapt data lake layout 
and reduce the amount of data read during queries.
diff --git a/website/blog/2021-03-01-hudi-file-sizing.md 
b/website/blog/2021-03-01-hudi-file-sizing.md
index 3d3049601d..463ee0259b 100644
--- a/website/blog/2021-03-01-hudi-file-sizing.md
+++ b/website/blog/2021-03-01-hudi-file-sizing.md
@@ -11,7 +11,7 @@ manual table maintenance. Having a lot of small files will 
make it harder to ach
 having to open/read/close files way too many times, to plan and execute 
queries. But for streaming data lake use-cases, 
 inherently ingests are going to end up having smaller volume of writes, which 
might result in lot of small files if no special handling is done.
 <!--truncate-->
-# During Write vs After Write
+## During Write vs After Write
 
 Common approaches to writing very small files and then later stitching them 
together solve for system scalability issues posed 
 by small files but might violate query SLA's by exposing small files to them. 
In fact, you can easily do so on a Hudi table, 
@@ -25,7 +25,7 @@ Hudi has the ability to maintain a configured target file 
size, when performing
 (Note: bulk_insert operation does not provide this functionality and is 
designed as a simpler replacement for 
 normal `spark.write.parquet`).
 
-## Configs
+### Configs
 
 For illustration purposes, we are going to consider only COPY_ON_WRITE table.
 
@@ -41,7 +41,7 @@ would be considered a small file.
 
 If you wish to turn off this feature, set the config value for soft file limit 
to 0.
 
-## Example
+### Example
 
 Let’s say this is the layout of data files for a given partition.
 
diff --git a/website/blog/2021-08-18-virtual-keys.md 
b/website/blog/2021-08-18-virtual-keys.md
index c1ce8b5b09..57e44da270 100644
--- a/website/blog/2021-08-18-virtual-keys.md
+++ b/website/blog/2021-08-18-virtual-keys.md
@@ -13,13 +13,13 @@ In addition, it ensures data quality by ensuring unique key 
constraints are enfo
 But one of the repeated asks from the community is to leverage existing fields 
and not to add additional meta fields, for simple use-cases where such benefits 
are not desired or key changes are very rare.  
 <!--truncate-->
 
-# Virtual Key support
+## Virtual Key support
 Hudi now supports virtual keys, where Hudi meta fields can be computed on 
demand from the data fields. Currently, the meta fields are 
 computed once and stored as per record metadata and re-used across various 
operations. If one does not need incremental query support, 
 they can start leveraging Hudi's Virtual key support and still go about using 
Hudi to build and manage their data lake to reduce the storage 
 overhead due to per record metadata. 
 
-## Configurations
+### Configurations
 Virtual keys can be enabled for a given table using the below config. When set 
to `hoodie.populate.meta.fields=false`, 
 Hudi will use virtual keys for the corresponding table. Default value for this 
config is `true`, which means, all  meta fields will be added by default.
 
@@ -36,24 +36,24 @@ would entail reading all fields out of base and delta logs, 
sacrificing core col
 for users. Thus, we support only simple key generators (the default key 
generator, where both record key and partition path refer
 to an existing field ) for now.
 
-### Supported Key Generators with CopyOnWrite(COW) table:
+#### Supported Key Generators with CopyOnWrite(COW) table:
 SimpleKeyGenerator, ComplexKeyGenerator, CustomKeyGenerator, 
TimestampBasedKeyGenerator and NonPartitionedKeyGenerator. 
 
-### Supported Key Generators with MergeOnRead(MOR) table:
+#### Supported Key Generators with MergeOnRead(MOR) table:
 SimpleKeyGenerator
 
-### Supported Index types: 
+#### Supported Index types: 
 Only "SIMPLE" and "GLOBAL_SIMPLE" index types are supported in the first cut. 
We plan to add support for other index 
 (BLOOM, etc) in future releases. 
 
-## Supported Operations
+### Supported Operations
 All existing features are supported for a hudi table with virtual keys, except 
the incremental 
 queries. Which means, cleaning, archiving, metadata table, clustering, etc can 
be enabled for a hudi table with 
 virtual keys enabled. So, you are able to merely use Hudi as a transactional 
table format with all the awesome 
 table service runtimes and platform services, if you wish to do so, without 
incurring any overheads associated with 
 support for incremental data processing.
 
-## Sample Output
+### Sample Output
 As called out earlier, one has to set `hoodie.populate.meta.fields=false` to 
enable virtual keys. Let's see the 
 difference between records of a hudi table with and without virtual keys.
 
@@ -99,7 +99,7 @@ And here are some sample records for a hudi table with 
virtual keys enabled.
 As you could see, all meta fields are null in storage, but all users fields 
remain intact similar to a regular table.
 :::
 
-## Incremental Queries
+### Incremental Queries
 Since hudi does not maintain any metadata (like commit time at a record level) 
for a table with virtual keys enabled,  
 incremental queries are not supported. An exception will be thrown as below 
when an incremental query is triggered for such
 a table.
@@ -121,7 +121,7 @@ org.apache.hudi.exception.HoodieException: Incremental 
queries are not supported
   ... 61 elided
 ```
 
-## Conclusion 
+### Conclusion 
 Hope this blog was useful for you to learn yet another feature in Apache Hudi. 
If you are interested in 
 Hudi and looking to contribute, do check out 
[here](https://hudi.apache.org/contribute/get-involved). 
 
diff --git 
a/website/blog/2021-12-16-lakehouse-concurrency-control-are-we-too-optimistic.md
 
b/website/blog/2021-12-16-lakehouse-concurrency-control-are-we-too-optimistic.md
index f45ace997a..1072ace5f7 100644
--- 
a/website/blog/2021-12-16-lakehouse-concurrency-control-are-we-too-optimistic.md
+++ 
b/website/blog/2021-12-16-lakehouse-concurrency-control-are-we-too-optimistic.md
@@ -14,18 +14,17 @@ Having had the good fortune of working on diverse database 
projects - an RDBMS (
 
 First, let's set the record straight. RDBMS databases offer the richest set of 
transactional capabilities and the widest array of concurrency control 
[mechanisms](https://dev.mysql.com/doc/refman/5.7/en/innodb-locking-transaction-model.html).
 Different isolation levels, fine grained locking, deadlock 
detection/avoidance, and more are possible because they have to support 
row-level mutations and reads across many tables while enforcing [key 
constraints](https://dev.mysql.com/doc/refman/8. [...]
 
-# Pitfalls in Lake Concurrency Control
+### Pitfalls in Lake Concurrency Control
 
 Historically, data lakes have been viewed as batch jobs reading/writing files 
on cloud storage and it's interesting to see how most new work extends this 
view and implements glorified file version control using some form of 
"[**Optimistic concurrency 
control**](https://en.wikipedia.org/wiki/Optimistic_concurrency_control)" 
(OCC). With OCC jobs take a table level lock to check if they have impacted 
overlapping files and if a conflict exists, they abort their operations 
completely. Without [...]
 
 Imagine a real-life scenario of two writer processes : an ingest writer job 
producing new data every 30 minutes and a deletion writer job that is enforcing 
GDPR, taking 2 hours to issue deletes. It's very likely for these to overlap 
files with random deletes, and the deletion job is almost guaranteed to starve 
and fail to commit each time. In database speak, mixing long running 
transactions with optimism leads to disappointment, since the longer the 
transactions the higher the probabilit [...]
 
 ![concurrency](/assets/images/blog/concurrency/ConcurrencyControlConflicts.png)
-                
static/assets/images/blog/concurrency/ConcurrencyControlConflicts.png
 
 So, what's the alternative? Locking? Wikipedia also says - "_However, 
locking-based ("pessimistic") methods also can deliver poor performance because 
locking can drastically limit effective concurrency even when deadlocks are 
avoided."._ Here is where Hudi takes a different approach, that we believe is 
more apt for modern lake transactions which are typically long-running and even 
continuous. Data lake workloads share more characteristics with high throughput 
stream processing jobs, than [...]
 
-# Model 1 : Single Writer, Inline Table Services
+### Model 1 : Single Writer, Inline Table Services
 
 The simplest form of concurrency control is just no concurrency at all. A data 
lake table often has common services operating on it to ensure efficiency. 
Reclaiming storage space from older versions and logs, coalescing files 
(clustering in Hudi), merging deltas (compactions in Hudi), and more. Hudi can 
simply eliminate the need for concurrency control and maximizes throughput by 
supporting these table services out-of-box and running inline after every write 
to the table.
 
@@ -33,13 +32,13 @@ Execution plans are idempotent, persisted to the timeline 
and auto-recover from
 
 
![concurrency-single-writer](/assets/images/blog/concurrency/SingleWriterInline.gif)
 
-# Model 2 : Single Writer, Async Table Services
+### Model 2 : Single Writer, Async Table Services
 
 Our delete/ingest example above is n't really that simple. While ingest/writer 
may just be updating the last N partitions on the table, delete may span across 
the entire table even. Mixing them in the same job, could slow down ingest 
latency by a lot. But, Hudi provides the option of running the table services 
in an async fashion, where most of the heavy lifting (e.g actually rewriting 
the columnar data by compaction service) is done asynchronously, eliminating 
any repeated wasteful retr [...]
 
 ![concurrency-async](/assets/images/blog/concurrency/SingleWriterAsync.gif)
 
-# Model 3 : Multiple Writers
+### Model 3 : Multiple Writers
 
 But it's not always possible to serialize the deletes into the same write 
stream or sql based deletes are required. With multiple distributed processes, 
some form of locking is inevitable, but like real databases Hudi's concurrency 
model is intelligent enough to differentiate actual writing to the table, from 
table services that manage or optimize the table. Hudi offers similar 
optimistic concurrency control across multiple writers, but table services can 
still execute completely lock-fr [...]
 
diff --git 
a/website/blog/2021-12-29-hudi-zorder-and-hilbert-space-filling-curves.md 
b/website/blog/2021-12-29-hudi-zorder-and-hilbert-space-filling-curves.md
index 2ebe72fea6..cda7b5c66e 100644
--- a/website/blog/2021-12-29-hudi-zorder-and-hilbert-space-filling-curves.md
+++ b/website/blog/2021-12-29-hudi-zorder-and-hilbert-space-filling-curves.md
@@ -10,7 +10,7 @@ As of Hudi v0.10.0, we are excited to introduce support for 
an advanced Data Lay
 
 <!--truncate-->
 
-## Background
+### Background
 
 Amazon EMR team recently published a [great 
article](https://aws.amazon.com/blogs/big-data/new-features-from-apache-hudi-0-7-0-and-0-8-0-available-on-amazon-emr/)
 show-casing how [clustering](https://hudi.apache.org/docs/clustering) your 
data can improve your _query performance_.
 
@@ -71,7 +71,7 @@ In a similar fashion, Hilbert curves also allow you to map 
points in a N-dimensi
 
 Now, let's check it out in action!
 
-# Setup
+### Setup
 We will use the [Amazon 
Reviews](https://s3.amazonaws.com/amazon-reviews-pds/readme.html) dataset 
again, but this time we will use Hudi to Z-Order by `product_id`, `customer_id` 
columns tuple instead of Clustering or _linear ordering_.
 
 No special preparations are required for the dataset, you can simply download 
it from [S3](https://s3.amazonaws.com/amazon-reviews-pds/readme.html) in 
Parquet format and use it directly as an input for Spark ingesting it into Hudi 
table.
@@ -150,7 +150,7 @@ df.write.format("hudi")
 
 
 
-# Testing
+### Testing
 Please keep in mind, that each individual test is run in a separate 
spark-shell to avoid caching getting in the way of our measurements.
 
 ```scala
@@ -300,7 +300,7 @@ scala> runQuery3(dataSkippingSnapshotTableName)
 +-----------+-----------+
 ```
 
-# Results
+### Results
 We've summarized the measured performance metrics below:
 
 | **Query** | **Baseline (B)** duration (files scanned / size) | **Linear 
Sorting (S)** | **Z-order (Z)** duration (scanned) | **Hilbert (H)** duration 
(scanned) |
@@ -315,6 +315,6 @@ Which is a very clear contrast with space-filling curves 
(both Z-order and Hilbe
 
 It's worth noting that the performance gains are heavily dependent on your 
underlying data and queries. In benchmarks on our internal data we were able to 
achieve queries performance improvements of more than **11x!**
 
-# Epilogue
+### Epilogue
 
 Apache Hudi v0.10 brings new layout optimization capabilities Z-order and 
Hilbert to open source. Using these industry leading layout optimization 
techniques can bring substantial performance improvement and cost savings to 
your queries!
diff --git 
a/website/blog/2022-04-04-New-features-from-Apache-Hudi-0.9.0-on-Amazon-EMR.mdx 
b/website/blog/2022-04-04-New-features-from-Apache-Hudi-0.9.0-on-Amazon-EMR.mdx
index 9fa5b5abb9..538145b3a9 100644
--- 
a/website/blog/2022-04-04-New-features-from-Apache-Hudi-0.9.0-on-Amazon-EMR.mdx
+++ 
b/website/blog/2022-04-04-New-features-from-Apache-Hudi-0.9.0-on-Amazon-EMR.mdx
@@ -5,6 +5,7 @@ authors:
 - name: Gabriele Cacciola
 - name: Udit Mehrotra
 category: blog
+image: /assets/images/powers/aws.jpg
 ---
 
 import Redirect from '@site/src/components/Redirect';
diff --git 
a/website/static/assets/images/blog/2022-01-25-cost-efficiency-at-scale-in-big-data-file-format.png
 
b/website/static/assets/images/blog/2022-01-25-cost-efficiency-at-scale-in-big-data-file-format.png
new file mode 100644
index 0000000000..34e818702f
Binary files /dev/null and 
b/website/static/assets/images/blog/2022-01-25-cost-efficiency-at-scale-in-big-data-file-format.png
 differ
diff --git 
a/website/static/assets/images/blog/2022-02-02-onehouse-commitment-to-openness.jpeg
 
b/website/static/assets/images/blog/2022-02-02-onehouse-commitment-to-openness.jpeg
new file mode 100644
index 0000000000..a836cf7365
Binary files /dev/null and 
b/website/static/assets/images/blog/2022-02-02-onehouse-commitment-to-openness.jpeg
 differ
diff --git 
a/website/static/assets/images/blog/2022-02-03-onehouse_billboard.png 
b/website/static/assets/images/blog/2022-02-03-onehouse_billboard.png
new file mode 100644
index 0000000000..86f44ee020
Binary files /dev/null and 
b/website/static/assets/images/blog/2022-02-03-onehouse_billboard.png differ
diff --git 
a/website/static/assets/images/blog/2022-02-17-fresher-data-lake-on-aws-s3.png 
b/website/static/assets/images/blog/2022-02-17-fresher-data-lake-on-aws-s3.png
new file mode 100644
index 0000000000..624264dacb
Binary files /dev/null and 
b/website/static/assets/images/blog/2022-02-17-fresher-data-lake-on-aws-s3.png 
differ
diff --git 
a/website/static/assets/images/blog/2022-03-01-low-latency-pipeline-using-msk-flink-hudi.png
 
b/website/static/assets/images/blog/2022-03-01-low-latency-pipeline-using-msk-flink-hudi.png
new file mode 100644
index 0000000000..9e95594741
Binary files /dev/null and 
b/website/static/assets/images/blog/2022-03-01-low-latency-pipeline-using-msk-flink-hudi.png
 differ
diff --git 
a/website/static/assets/images/blog/2022-03-09-serverless-pipeline-using-glue-hudi-s3.png
 
b/website/static/assets/images/blog/2022-03-09-serverless-pipeline-using-glue-hudi-s3.png
new file mode 100644
index 0000000000..118839e543
Binary files /dev/null and 
b/website/static/assets/images/blog/2022-03-09-serverless-pipeline-using-glue-hudi-s3.png
 differ
diff --git 
a/website/static/assets/images/blog/2022-04-04-halodoc-lakehouse-architecture.png
 
b/website/static/assets/images/blog/2022-04-04-halodoc-lakehouse-architecture.png
new file mode 100644
index 0000000000..134fb77f64
Binary files /dev/null and 
b/website/static/assets/images/blog/2022-04-04-halodoc-lakehouse-architecture.png
 differ
diff --git a/website/static/assets/images/blog/2022-05-17-multimodal-index.gif 
b/website/static/assets/images/blog/2022-05-17-multimodal-index.gif
new file mode 100644
index 0000000000..3e705205a6
Binary files /dev/null and 
b/website/static/assets/images/blog/2022-05-17-multimodal-index.gif differ

[hudi] branch asf-site updated: [DOCS] Add image assets and fix blog post styles (#5613)

Reply via email to