[hudi] branch asf-site updated: [BLOG] Apache Hudi is the Streaming Data Lake Platform (#3322)

vinoth Mon, 26 Jul 2021 23:00:45 -0700

This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 2c71f98  [BLOG] Apache Hudi is the Streaming Data Lake Platform (#3322)
2c71f98 is described below

commit 2c71f982cd4ef6f72a41db74bd98c1592e1c8c1c
Author: vinoth chandar <[email protected]>
AuthorDate: Mon Jul 26 23:00:22 2021 -0700

    [BLOG] Apache Hudi is the Streaming Data Lake Platform (#3322)
---
 docs/_config.yml                                   |   6 +-
 docs/_data/authors.yml                             |   2 +-
 docs/_pages/index.md                               |   2 +-
 .../2021-07-21-streaming-data-lake-platform.md     | 150 +++++++++++++++++++++
 docs/_sass/hudi_style/_variables.scss              |   5 +-
 .../Hudi_design_diagram_-_Page_2_1.png             | Bin 0 -> 52035 bytes
 .../Screen_Shot_2021-07-20_at_5.35.47_PM.png       | Bin 0 -> 163959 bytes
 .../images/blog/datalake-platform/hudi-comic.png   | Bin 0 -> 93630 bytes
 .../datalake-platform/hudi-data-lake-platform.png  | Bin 0 -> 128340 bytes
 .../hudi-data-lake-platform_-_Copy_of_Page_1_3.png | Bin 0 -> 130359 bytes
 .../hudi-data-lake-platform_-_Page_2_4.png         | Bin 0 -> 282177 bytes
 .../hudi-design-diagram_-incr-read.png             | Bin 0 -> 57567 bytes
 .../hudi-design-diagrams-table-format.png          | Bin 0 -> 42148 bytes
 .../hudi-design-diagrams_-_Page_2_1.png            | Bin 0 -> 70552 bytes
 .../hudi-design-diagrams_-_Page_4.png              | Bin 0 -> 81348 bytes
 .../hudi-design-diagrams_-_Page_5.png              | Bin 0 -> 123834 bytes
 .../hudi-design-diagrams_-_Page_6.png              | Bin 0 -> 74018 bytes
 .../hudi-design-diagrams_-_Page_7.png              | Bin 0 -> 95019 bytes
 .../hudi-design-diagrams_-_Page_8.png              | Bin 0 -> 39783 bytes
 19 files changed, 157 insertions(+), 8 deletions(-)

diff --git a/docs/_config.yml b/docs/_config.yml
index cc0e6a3..2db32ac 100644
--- a/docs/_config.yml
+++ b/docs/_config.yml
@@ -44,7 +44,7 @@ locale                   : "en-US"
 title                    : # "Apache Hudi"
 title_separator          : "-"
 subtitle                 : "" # *version
-description              : "Apache Hudi brings upserts, deletes and stream 
processing to data lakes built on HDFS or cloud storage."
+description              : "Apache Hudi is the Streaming Data Lake Platform."
 url                      : https://hudi.apache.org # the base hostname & 
protocol for your site e.g. "https://github.com/apache/hudi.git";
 repository               : "apache/incubator-hudi"
 teaser                   : "/assets/images/500x300.png" # path of fallback 
teaser image, e.g. "/assets/images/500x300.png"
@@ -56,7 +56,7 @@ site_url                 : https://hudi.apache.org
 # Site QuickLinks
 author:
   name             : "Quick Links"
-  bio              : "Hudi *ingests* & *manages* storage of large analytical 
datasets over DFS."
+  bio              : "Hudi is the Streaming Data Lake Platform."
   links:
     - label: "Documentation"
       icon: "fa fa-book"
@@ -82,7 +82,7 @@ author:
 
 cn_author:
   name             : "Quick Links"
-  bio              : "Hudi *ingests* & *manages* storage of large analytical 
datasets over DFS."
+  bio              : "Hudi is the Streaming Data Lake Platform."
   links:
     - label: "Documentation"
       icon: "fa fa-book"
diff --git a/docs/_data/authors.yml b/docs/_data/authors.yml
index 539bdd1..03bb569 100644
--- a/docs/_data/authors.yml
+++ b/docs/_data/authors.yml
@@ -6,7 +6,7 @@ admin:
 
 vinoth:
     name: Vinoth Chandar
-    web: https://cwiki.apache.org/confluence/display/~vinoth
+    web: https://twitter.com/byte_array
 
 rxu:
     name: Raymond Xu
diff --git a/docs/_pages/index.md b/docs/_pages/index.md
index 0736809..6ee98c6 100644
--- a/docs/_pages/index.md
+++ b/docs/_pages/index.md
@@ -3,7 +3,7 @@ layout: home
 permalink: /
 title: Welcome to Apache Hudi !
 excerpt: >
-  Apache Hudi ingests & manages storage of large analytical datasets over DFS 
(hdfs or cloud stores).<br />
+  Apache Hudi is the Streaming Data Lake Platform.<br />
   <small><a href="https://github.com/apache/hudi/releases/tag/release-0.8.0"; 
target="_blank">Latest release 0.8.0</a></small>
 power_items:
   - img_path: /assets/images/powers/aws.jpg
diff --git a/docs/_posts/2021-07-21-streaming-data-lake-platform.md 
b/docs/_posts/2021-07-21-streaming-data-lake-platform.md
new file mode 100644
index 0000000..705313d
--- /dev/null
+++ b/docs/_posts/2021-07-21-streaming-data-lake-platform.md
@@ -0,0 +1,150 @@
+---
+title: "Apache Hudi - The Streaming Data Lake Platform"
+excerpt: "It's been called many things. But, we have always been building a 
data lake platform"
+author: vinoth
+category: blog
+---
+
+As early as 2016, we set out a [bold, new 
vision](https://www.oreilly.com/content/ubers-case-for-incremental-processing-on-hadoop/)
 reimagining batch data processing through a new “**incremental**” data 
processing stack - alongside the existing batch and streaming stacks. 
+While a stream processing pipeline does row-oriented processing, delivering a 
few seconds of processing latency, an incremental pipeline would apply the same 
principles to *columnar* data in the data lake, 
+delivering orders of magnitude improvements in processing efficiency within 
few minutes, on extremely scalable batch storage/compute infrastructure. This 
new stack would be able to effortlessly support regular batch processing for 
bulk reprocessing/backfilling as well.
+Hudi was built as the manifestation of this vision, rooted in real, hard 
problems faced at [Uber](https://eng.uber.com/uber-big-data-platform/) and 
later took a life of its own in the open source community. Together, we have 
been able to 
+usher in fully incremental data ingestion and moderately complex ETLs on data 
lakes already.
+
+![the different components that make up the stream and batch processing stack 
today, showing how an incremental stack blends the best of both the 
worlds.](/assets/images/blog/datalake-platform/hudi-data-lake-platform_-_Page_2_4.png)
+
+Today, this grand vision of being able to express almost any batch pipeline 
incrementally is more attainable than it ever was. Stream processing is 
[maturing rapidly](https://flink.apache.org/blog/) and gaining [tremendous 
momentum](https://www.confluent.io/blog/every-company-is-becoming-software/), 
+with 
[generalization](https://flink.apache.org/2021/03/11/batch-execution-mode.html) 
of stream processing APIs to work over a batch execution model. Hudi completes 
the missing pieces of the puzzle by providing streaming optimized lake storage, 
+much like how Kafka/Pulsar enable efficient storage for event streaming. [Many 
organizations](https://hudi.apache.org/docs/powered_by.html) have already 
reaped real benefits of adopting a streaming model for their data lakes, in 
terms of fresh data, simplified architecture and great cost reductions.
+
+But first, we needed to tackle the basics - transactions and mutability - on 
the data lake. In many ways, Apache Hudi pioneered the transactional data lake 
movement as we know it today. Specifically, during a time when more 
special-purpose systems were being born, Hudi introduced a server-less, 
transaction layer, which worked over the general-purpose Hadoop FileSystem 
abstraction on Cloud Stores/HDFS. This model helped Hudi to scale 
writers/readers to 1000s of cores on day one, compared  [...]
+
+This is going to be a rather long post, but we will do our best to make it 
worth your time. Let’s roll.
+
+## Data Lake Platform
+
+We have noticed that, Hudi is sometimes positioned as a “[table 
format](https://cloud.google.com/blog/products/data-analytics/getting-started-with-new-table-formats-on-dataproc)”
 or “transactional layer”. While this is not incorrect, this does not do full 
justice to all that Hudi has to offer. 
+
+### Is Hudi a “format”?
+
+Hudi was not designed as a general purpose table format, tracking 
files/folders for batch processing. Rather, the functionality provided by a 
table format is merely one layer in the Hudi software stack. Hudi was designed 
to play well with the Hive format (if you will), given how popular and 
widespread it is. Over time, to solve scaling challenges or bring in additional 
functionality, we have invested in our own native table format with an eye for 
incremental processing vision. for e.g, w [...]
+
+### Is Hudi a transactional layer?
+
+Of course, Hudi had to provide transactions for implementing deletes/updates, 
but Hudi’s transactional layer is designed around an [event 
log](https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying)
 that is also well-integrated with an entire set of built-in table/data 
services. For e.g compaction is aware of clustering actions already scheduled 
and optimizes by skipping over the files being clustered - while the u [...]
+
+Thus, the best way to describe Apache Hudi is as a **Streaming Data Lake 
Platform** built around a *database kernel*. The words carry significant 
meaning.
+
+![/assets/images/blog/datalake-platform/Screen_Shot_2021-07-20_at_5.35.47_PM.png](/assets/images/blog/datalake-platform/Screen_Shot_2021-07-20_at_5.35.47_PM.png)
+
+**Streaming**: At its core, by optimizing for fast upserts & change streams, 
Hudi provides the primitives to data lake workloads that are comparable to what 
[Apache Kafka](https://kafka.apache.org/) does for event-streaming (namely, 
incremental produce/consume of events and a state-store for interactive 
querying).
+
+**Data Lake**: Nonetheless, Hudi provides an optimized, self-managing data 
plane for large scale data processing on the lake (adhoc queries, ML pipelines, 
batch pipelines), powering arguably the [largest transactional 
lake](https://eng.uber.com/apache-hudi-graduation/) in the world. While Hudi 
can be used to build a 
[lakehouse](https://databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html),
 given its transactional capabilities, Hudi goes beyond and unlocks an 
end-to-end streaming  [...]
+
+**Platform**: Oftentimes in open source, there is great tech, but there is 
just too many of them - all differing ever so slightly in their opinionated 
ways, ultimately making the integration task onerous on the end user. Lake 
users deserve the same great usability that cloud warehouses provide, with the 
additional freedom and transparency of a true open source community. Hudi’s 
data and table services, tightly integrated with the Hudi “kernel”, gives us 
the ability to deliver cross layer [...]
+
+## Hudi Stack
+
+The following stack captures layers of software components that make up Hudi, 
with each layer depending on and drawing strength from the layer below. 
Typically, data lake users write data out once using an open file format like 
Apache [Parquet](http://parquet.apache.org/)/[ORC](https://orc.apache.org/) 
stored on top of extremely scalable cloud storage or distributed file systems. 
Hudi provides a self-managing data plane to ingest, transform and manage this 
data, in a way that unlocks inc [...]
+
+![Figure showing the Hudi 
stack](/assets/images/blog/datalake-platform/hudi-data-lake-platform_-_Copy_of_Page_1_3.png)
+
+Furthermore, Hudi either already provides or plans to add components that make 
this data universally accessible to all the different query engines out there. 
The features annotated with `*` represent work in progress and dotted boxes 
represent planned future work, to complete our vision for the project. 
+While we have strawman designs outlined for the newer components in the blog, 
we welcome with open arms fresh perspectives from the community.
+Rest of the blog will delve into each layer in our stack - explaining what it 
does, how it's designed for incremental processing and how it will evolve in 
the future.
+
+## Lake Storage
+
+Hudi interacts with lake storage using the [Hadoop FileSystem 
API](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html),
 which makes it compatible with all of its implementations ranging from HDFS to 
Cloud Stores to even in-memory filesystems like 
[Alluxio](https://www.alluxio.io/blog/building-high-performance-data-lake-using-apache-hudi-and-alluxio-at-t3go/)/Ignite.
 Hudi internally implements its own [wrapper 
filesystem](https://github.com/apache/hudi/blob/9d2 [...]
+
+## File Format
+
+Hudi is designed around the notion of base file and delta log files that store 
updates/deltas to a given base file (called a file slice). Their formats are 
pluggable, with Parquet (columnar access) and HFile (indexed access) being the 
supported base file formats today. The delta logs encode data in 
[Avro](http://avro.apache.org/) (row oriented) format for speedier logging 
(just like Kafka topics for e.g). Going forward, we plan to [inline any base 
file format](https://github.com/apache/h [...]
+
+Zooming one level up, Hudi's unique file layout scheme encodes all changes to 
a given base file, as a sequence of blocks (data blocks, delete blocks, 
rollback blocks) that are merged in order to derive newer base files. In 
essence, this makes up a self contained redo log that the lets us implement 
interesting features on top. For e.g, most of today's data privacy enforcement 
happens by masking data read off the lake storage on-the-fly, invoking 
hashing/encryption algorithms over and over [...]
+
+![Hudi base and delta 
logs](/assets/images/blog/datalake-platform/hudi-design-diagrams_-_Page_2_1.png)
+
+## Table Format
+
+The term “table format” is new and still means many things to many people. 
Drawing an analogy to file formats, a table format simply consists of : the 
file layout of the table, table’s schema and metadata tracking changes to the 
table. Hudi is not a table format, it implements one internally. Hudi uses Avro 
schemas to store, manage and evolve a table’s schema. Currently, Hudi enforces 
schema-on-write, which although stricter than schema-on-read, is adopted 
[widely](https://docs.confluent [...]
+
+Hudi consciously lays out files within a table/partition into groups and 
maintains a mapping between an incoming record’s key to an existing file group. 
All updates are recorded into delta log files specific to a given file group 
and this design ensures low merge overhead compared to approaches like Hive 
ACID, which have to merge all delta records against all base files to satisfy 
queries. For e.g, with uuid keys (used very widely) all base files are very 
likely to overlap with all delta [...]
+
+![Shows the Hudi table format 
components](/assets/images/blog/datalake-platform/hudi-design-diagrams-table-format.png)
+
+The *timeline* is the source-of-truth event log for all Hudi’s table metadata, 
stored under the `.hoodie` folder, that provides an ordered log of all actions 
performed on the table. Events are retained on the timeline up to a configured 
interval of time/activity. Each file group is also designed as it’s own 
self-contained log, which means that even if an action that affected a file 
group is archived from the timeline, the right state of the records in each 
file group can be reconstructed [...]
+
+Lastly, new events on the timeline are then consumed and reflected onto an 
internal metadata table, implemented as another merge-on-read table offering 
low write amplification. Hudi is able to absorb quick/rapid changes to table’s 
metadata, unlike table formats designed for slow-moving data. Additionally, the 
metadata table uses the 
[HFile](https://hbase.apache.org/2.0/devapidocs/org/apache/hadoop/hbase/io/hfile/HFile.html)
 base file format, which provides indexed lookups of keys avoidin [...]
+
+A key challenge faced by all the table formats out there today, is the need 
for expiring snapshots/controlling retention for time travel queries such that 
it does not interfere with query planning/performance. In the future, we plan 
to build an indexed timeline in Hudi, which can span the entire history of the 
table, supporting a time travel look back window of several months/years.
+
+## Indexes
+
+Indexes help databases plan better queries, that reduce the overall amount of 
I/O and deliver faster response times. Table metadata about file listings and 
column statistics are often enough for lake query engines to generate 
optimized, engine specific query plans quickly. This is however not sufficient 
for Hudi to realize fast upserts. Hudi already supports different key based 
indexing schemes to quickly map incoming record keys into the file group they 
reside in. For this purpose, Hudi [...]
+
+![/assets/images/blog/datalake-platform/hudi-design-diagrams_-_Page_5.png](/assets/images/blog/datalake-platform/hudi-design-diagrams_-_Page_5.png)
+
+In the future, we intend to add additional forms of indexing as new partitions 
on the metadata table. Let’s discuss the role  each one has to play briefly. 
Query engines typically rely on partitioning to cut down the number of files 
read for a given query. In database terms, a Hive partition is nothing but a 
coarse range index, that maps a set of columns to a list of files. Table 
formats born in the cloud like Iceberg/Delta Lake, have built-in tracking of 
column ranges per file in a sing [...]
+
+While Hudi already supports external indexes for random write workloads, we 
would like to support [point-lookup-ish 
queries](https://github.com/apache/hudi/pull/2487) right on top of lake 
storage, which helps avoid the overhead of an additional database for many 
classes of data applications. We also anticipate that uuid/key based joins will 
be sped up a lot, by leveraging record level indexing schemes, we build out for 
fast upsert performance. We also plan to move our tracking of bloom f [...]
+
+## Concurrency Control
+
+Concurrency control defines how different writers/readers coordinate access to 
the table. Hudi ensures atomic writes, by way of publishing commits atomically 
to the timeline, stamped with an instant time that denotes the time at which 
the action is deemed to have occurred. Unlike general purpose file version 
control, Hudi draws clear distinction between writer processes (that issue 
user’s upserts/deletes), table services (that write data/metadata to 
optimize/perform bookkeeping) and read [...]
+
+Projects that solely rely on OCC deal with competing operations, by either 
implementing a lock or relying on atomic renames. Such approaches are 
optimistic that real contention never happens and resort to failing one of the 
writer operations if conflicts occur, which can cause significant resource 
wastage or operational overhead. Imagine a scenario of two writer processes : 
an ingest writer job producing new data every 30 minutes and a deletion writer 
job that is enforcing GDPR taking 2  [...]
+
+![Figure showing competing transactions leading to starvation with just 
OCC](/assets/images/blog/datalake-platform/Hudi_design_diagram_-_Page_2_1.png)
+
+We are hard at work, improving our OCC based implementation around early 
detection of conflicts for concurrent writers and terminate early without 
burning up CPU resources. We are also working on [adding fully log 
based](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+22+%3A+Snapshot+Isolation+using+Optimistic+Concurrency+Control+for+multi-writers#RFC22:SnapshotIsolationusingOptimisticConcurrencyControlformultiwriters-FutureWork(LockFree-ishConcurrencyControl)),
 non-blocking concu [...]
+
+## Writers
+
+Hudi tables can be used as sinks for Spark/Flink pipelines and the Hudi 
writing path provides several enhanced capabilities over file writing done by 
vanilla parquet/avro sinks. Hudi classifies write operations carefully into 
incremental (`insert`, `upsert`, `delete`) and batch/bulk operations 
(`insert_overwrite`, `insert_overwrite_table`, `delete_partition`, 
`bulk_insert`) and provides relevant functionality for each operation in a 
performant and cohesive way. Both upsert and delete ope [...]
+
+Keys are first class citizens inside Hudi and the pre-combining/index lookups 
done before upsert/deletes ensure a key is unique across partitions or within 
partitions, as desired. In contrast with other approaches where this is left to 
data engineer to co-ordinate using `MERGE INTO` statements, this approach 
ensures quality data especially for critical use-cases. Hudi also ships with 
several [built-in key 
generators](http://hudi.apache.org/blog/hudi-key-generators/) that can parse 
all co [...]
+
+Hudi writers add metadata to each record, that codify the commit time and a 
sequence number for each record within that commit (comparable to a Kafka 
offset), which make it possible to derive record level change streams. Hudi 
also provides users the ability to specify event time fields in incoming data 
streams and track them in the timeline.Mapping these to stream processing 
concepts, Hudi contains both [arrival and event 
time](https://www.oreilly.com/radar/the-world-beyond-batch-streami [...]
+
+## Readers
+
+Hudi provides snapshot isolation between writers and readers and allows for 
any table snapshot to be queries consistently from all major lake query engines 
(Spark, Hive, Flink, Presto, Trino, Impala) and even cloud warehouses like 
Redshift. In fact, we would love to bring Hudi tables as external tables with 
BigQuery/Snowflake as well, once they also embrace the lake table formats more 
natively. Our design philosophy around query performance has been to make Hudi 
as lightweight as possibl [...]
+
+![Log merging done for incremental 
queries](/assets/images/blog/datalake-platform/hudi-design-diagram_-incr-read.png)
+
+True to its design goals, Hudi provides some very powerful incremental 
querying capabilities that tied together the meta fields added during writing 
and the file group based storage layout. While table formats that merely track 
files, are only able to provide information about files that changed during 
each snapshot or commits, Hudi generates the exact set of records that changed 
given a point in the timeline, due to tracking of record level event and 
arrival times. Further more, this de [...]
+
+## Table Services
+
+What defines and sustains a project’s value over years are its fundamental 
design principles and the subtle trade offs. Databases often consist of several 
internal components, working in tandem to deliver efficiency, performance and 
great operability to its users. True to intent to act as state store for 
incremental data pipelines, we designed Hudi with built-in table services and 
self-managing runtime that can orchestrate/trigger these services to optimize 
everything internally. In fact [...]
+
+![/assets/images/blog/datalake-platform/hudi-design-diagrams_-_Page_4.png](/assets/images/blog/datalake-platform/hudi-design-diagrams_-_Page_4.png)
+
+There are several built-in table services, all with the goal of ensuring 
performant table storage layout and metadata management, which are 
automatically invoked either synchronously after each write operation, or 
asynchronously as a separate background job. Furthermore, Spark (and Flink) 
streaming writers can run in continuous mode, and invoke table services 
asynchronously sharing the underlying executors intelligently with writers. 
Archival service ensures that the timeline holds suffi [...]
+
+We are always looking for ways to improve and enhance our table services in 
meaningful ways. In the coming releases, we are working towards a much more 
[scalable model](https://github.com/apache/hudi/pull/3233) of cleaning up 
partial writes, by consolidating marker file creation using our timeline 
metaserver, which avoids expensive full table scans to seek out and remove 
uncommitted files. We also have [various 
proposals](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=1 
[...]
+
+## Data Services
+
+As noted at the start, we wanted to make Hudi immediately usable for common 
end-end use-cases and thus invested deeply into a set of data services, that 
provide functionality that is data/workload specific, sitting on top of the 
table services, writers/readers directly. Foremost in that list, is the Hudi 
DeltaStreamer utility, which has been an extremely popular choice for 
painlessly building a data lake out of  Kafka streams and files landing in 
different formats on top of lake storage. [...]
+
+![/assets/images/blog/datalake-platform/hudi-design-diagrams_-_Page_8.png](/assets/images/blog/datalake-platform/hudi-design-diagrams_-_Page_8.png)
+
+Going forward, we would love contributions to enhance our [multi delta 
streamer 
utility](http://hudi.apache.org/blog/ingest-multiple-tables-using-hudi/), which 
can ingest entire Kafka clusters in a single large Spark application, to be on 
par and hardened. To further our progress towards end-end complex incremental 
pipelines, we plan to work towards enhancing the delta streamer utility and its 
SQL transformers to be triggered by multiple source streams (as opposed to just 
the one today)  [...]
+
+## Timeline Metaserver
+
+Storing and serving table metadata right on the lake storage is scalable, but 
can be much less performant compared to RPCs against a scalable meta server. 
Most cloud warehouses internally are built on a metadata layer that leverages 
an external database (e.g [Snowflake uses 
foundationDB](https://www.snowflake.com/blog/how-foundationdb-powers-snowflake-metadata-forward/)).
 Hudi also provides a metadata server, called the “Timeline server”, which 
offers an alternative backing store for Hud [...]
+
+![/assets/images/blog/datalake-platform/hudi-design-diagrams_-_Page_6.png](/assets/images/blog/datalake-platform/hudi-design-diagrams_-_Page_6.png)
+
+## Lake Cache
+
+There is a fundamental tradeoff today in data lakes between faster writing and 
great query performance. Faster writing typically involves writing smaller 
files (and later clustering them) or logging deltas (and later merging on 
read). While this provides good performance already, the pursuit of great query 
performance often warrants opening fewer number of files/objects on lake 
storage and may be pre-materializing the merges between base and delta logs. 
After all, most databases employ a [...]
+
+![/assets/images/blog/datalake-platform/hudi-design-diagrams_-_Page_7.png](/assets/images/blog/datalake-platform/hudi-design-diagrams_-_Page_7.png)
+
+## Onwards
+
+We hope that this blog painted a complete picture of Apache Hudi, staying true 
to its founding principles. Interested users and readers can expect blogs 
delving into each layer of the stack and an overhaul of our docs along these 
lines in the coming weeks/months. We view the current efforts around table 
formats as merely removing decade-old bottlenecks in data lake storage/query 
planes, problems which have been already solved very well in cloud warehouses 
like Big Query/Snowflake. We wou [...]
\ No newline at end of file
diff --git a/docs/_sass/hudi_style/_variables.scss 
b/docs/_sass/hudi_style/_variables.scss
index 5194fad..eae3926 100644
--- a/docs/_sass/hudi_style/_variables.scss
+++ b/docs/_sass/hudi_style/_variables.scss
@@ -14,8 +14,7 @@ $indent-var: 1.3em !default;
 
 /* system typefaces */
 $serif: Georgia, Times, serif !default;
-$sans-serif: -apple-system, BlinkMacSystemFont, "Roboto", "Segoe UI",
-  "Helvetica Neue", "Lucida Grande", Arial, sans-serif !default;
+$sans-serif: "Open Sans","Helvetica Neue",Helvetica, Spectral, sans-serif 
!default;
 $monospace: Monaco, Consolas, "Lucida Console", monospace !default;
 
 /* sans serif typefaces */
@@ -135,7 +134,7 @@ $small: 600px !default;
 $medium: 768px !default;
 $medium-wide: 900px !default;
 $large: 1024px !default;
-$x-large: 1280px !default;
+$x-large: 1920px !default;
 $max-width: $x-large !default;
 
 /*
diff --git 
a/docs/assets/images/blog/datalake-platform/Hudi_design_diagram_-_Page_2_1.png 
b/docs/assets/images/blog/datalake-platform/Hudi_design_diagram_-_Page_2_1.png
new file mode 100644
index 0000000..9d4a923
Binary files /dev/null and 
b/docs/assets/images/blog/datalake-platform/Hudi_design_diagram_-_Page_2_1.png 
differ
diff --git 
a/docs/assets/images/blog/datalake-platform/Screen_Shot_2021-07-20_at_5.35.47_PM.png
 
b/docs/assets/images/blog/datalake-platform/Screen_Shot_2021-07-20_at_5.35.47_PM.png
new file mode 100644
index 0000000..69272ca
Binary files /dev/null and 
b/docs/assets/images/blog/datalake-platform/Screen_Shot_2021-07-20_at_5.35.47_PM.png
 differ
diff --git a/docs/assets/images/blog/datalake-platform/hudi-comic.png 
b/docs/assets/images/blog/datalake-platform/hudi-comic.png
new file mode 100644
index 0000000..7b5521f
Binary files /dev/null and 
b/docs/assets/images/blog/datalake-platform/hudi-comic.png differ
diff --git 
a/docs/assets/images/blog/datalake-platform/hudi-data-lake-platform.png 
b/docs/assets/images/blog/datalake-platform/hudi-data-lake-platform.png
new file mode 100644
index 0000000..b279316
Binary files /dev/null and 
b/docs/assets/images/blog/datalake-platform/hudi-data-lake-platform.png differ
diff --git 
a/docs/assets/images/blog/datalake-platform/hudi-data-lake-platform_-_Copy_of_Page_1_3.png
 
b/docs/assets/images/blog/datalake-platform/hudi-data-lake-platform_-_Copy_of_Page_1_3.png
new file mode 100644
index 0000000..a291ecd
Binary files /dev/null and 
b/docs/assets/images/blog/datalake-platform/hudi-data-lake-platform_-_Copy_of_Page_1_3.png
 differ
diff --git 
a/docs/assets/images/blog/datalake-platform/hudi-data-lake-platform_-_Page_2_4.png
 
b/docs/assets/images/blog/datalake-platform/hudi-data-lake-platform_-_Page_2_4.png
new file mode 100644
index 0000000..8cd3e7d
Binary files /dev/null and 
b/docs/assets/images/blog/datalake-platform/hudi-data-lake-platform_-_Page_2_4.png
 differ
diff --git 
a/docs/assets/images/blog/datalake-platform/hudi-design-diagram_-incr-read.png 
b/docs/assets/images/blog/datalake-platform/hudi-design-diagram_-incr-read.png
new file mode 100644
index 0000000..87ea844
Binary files /dev/null and 
b/docs/assets/images/blog/datalake-platform/hudi-design-diagram_-incr-read.png 
differ
diff --git 
a/docs/assets/images/blog/datalake-platform/hudi-design-diagrams-table-format.png
 
b/docs/assets/images/blog/datalake-platform/hudi-design-diagrams-table-format.png
new file mode 100644
index 0000000..e00bcec
Binary files /dev/null and 
b/docs/assets/images/blog/datalake-platform/hudi-design-diagrams-table-format.png
 differ
diff --git 
a/docs/assets/images/blog/datalake-platform/hudi-design-diagrams_-_Page_2_1.png 
b/docs/assets/images/blog/datalake-platform/hudi-design-diagrams_-_Page_2_1.png
new file mode 100644
index 0000000..a684c5e
Binary files /dev/null and 
b/docs/assets/images/blog/datalake-platform/hudi-design-diagrams_-_Page_2_1.png 
differ
diff --git 
a/docs/assets/images/blog/datalake-platform/hudi-design-diagrams_-_Page_4.png 
b/docs/assets/images/blog/datalake-platform/hudi-design-diagrams_-_Page_4.png
new file mode 100644
index 0000000..8747577
Binary files /dev/null and 
b/docs/assets/images/blog/datalake-platform/hudi-design-diagrams_-_Page_4.png 
differ
diff --git 
a/docs/assets/images/blog/datalake-platform/hudi-design-diagrams_-_Page_5.png 
b/docs/assets/images/blog/datalake-platform/hudi-design-diagrams_-_Page_5.png
new file mode 100644
index 0000000..6dbd70e
Binary files /dev/null and 
b/docs/assets/images/blog/datalake-platform/hudi-design-diagrams_-_Page_5.png 
differ
diff --git 
a/docs/assets/images/blog/datalake-platform/hudi-design-diagrams_-_Page_6.png 
b/docs/assets/images/blog/datalake-platform/hudi-design-diagrams_-_Page_6.png
new file mode 100644
index 0000000..e1decf0
Binary files /dev/null and 
b/docs/assets/images/blog/datalake-platform/hudi-design-diagrams_-_Page_6.png 
differ
diff --git 
a/docs/assets/images/blog/datalake-platform/hudi-design-diagrams_-_Page_7.png 
b/docs/assets/images/blog/datalake-platform/hudi-design-diagrams_-_Page_7.png
new file mode 100644
index 0000000..a0f8c25
Binary files /dev/null and 
b/docs/assets/images/blog/datalake-platform/hudi-design-diagrams_-_Page_7.png 
differ
diff --git 
a/docs/assets/images/blog/datalake-platform/hudi-design-diagrams_-_Page_8.png 
b/docs/assets/images/blog/datalake-platform/hudi-design-diagrams_-_Page_8.png
new file mode 100644
index 0000000..8b1e55e
Binary files /dev/null and 
b/docs/assets/images/blog/datalake-platform/hudi-design-diagrams_-_Page_8.png 
differ

[hudi] branch asf-site updated: [BLOG] Apache Hudi is the Streaming Data Lake Platform (#3322)

Reply via email to