This is an automated email from the ASF dual-hosted git repository.
vinoth pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/master by this push:
new aa0f57c6036 [MINOR] Change README (#12348)
aa0f57c6036 is described below
commit aa0f57c60361a7c03ec99da56c2264452f89927d
Author: vinoth chandar <[email protected]>
AuthorDate: Wed Nov 27 09:12:35 2024 -0800
[MINOR] Change README (#12348)
- Better, accurate description for the project based on current state.
- break down features by areas.
---
README.md | 69 ++++++++++++++++++++++++++++++++++++++++++++---------------
doap_HUDI.rdf | 4 ++--
2 files changed, 54 insertions(+), 19 deletions(-)
diff --git a/README.md b/README.md
index e79a294c034..fecbed44b0b 100644
--- a/README.md
+++ b/README.md
@@ -18,8 +18,8 @@
# Apache Hudi
-Apache Hudi (pronounced Hoodie) stands for `Hadoop Upserts Deletes and
Incrementals`. Hudi manages the storage of large
-analytical datasets on DFS (Cloud stores, HDFS or any Hadoop FileSystem
compatible storage).
+Apache Hudi is a open data lakehouse platform, built on a high-performance
open table format
+to ingest, index, store, serve, transform and manage your data across multiple
cloud data environments.
<img src="https://hudi.apache.org/assets/images/hudi-logo-medium.png"
alt="Hudi logo" height="80px" align="right" />
@@ -34,23 +34,58 @@ analytical datasets on DFS (Cloud stores, HDFS or any
Hadoop FileSystem compatib
[](https://twitter.com/apachehudi)
[](https://www.linkedin.com/company/apache-hudi/?viewAsMember=true)
+## Features
+Hudi stores all data and metadata on cloud storage in open formats, providing
the following features across different aspects.
-## Features
+### Ingestion
+
+* Built-in ingestion tools for Apache Spark/Apache Flink users.
+* Supports half-dozen file formats, database change logs and streaming data
systems.
+* Connect sink for Apache Kafka, to bring external data sources.
+
+### Storage
+
+* Optimized storage format, supporting row & columnar data.
+* Timeline metadata to track history of changes
+* Automatically manages file sizes, layout using statistics
+* Savepoints for data versioning and recovery
+* Schema tracking and evolution.
+
+### Indexing
+
+* Scalable indexing subsystem to speed up snapshot queries, maintained
automatically by writes.
+* Tracks file listings, column-level and partition-level statistics to help
plan queries efficiently.
+* Record-level indexing mechanisms built on row-oriented file formats and
bloom filters.
+* Logical partitioning on tables, using expression indexes to decouple from
physical partitioning on storage.
+
+### Writing
+
+* Atomically commit data with rollback/restore support.
+* Fast upsert/delete support leveraging record-level indexes.
+* Snapshot isolation between writer & queries.
+* Optimistic concurrency control to implement relational data model, with
Read-Modify-Write style consistent writes.
+* Non-blocking concurrency control, to implement streaming data model, with
support for out-of-order, late data handling.
+
+### Queries
+
+Hudi supports different types of queries, on top of a single table.
+
+* **Snapshot Query** - Provides a view of the table, as of the latest
committed state, accelerated with indexes as applicable.
+* **Incremental Query** - Provides latest value of records inserted/updated,
since a given point in time of the table. Can be used to "diff" table states
between two points in time.
+* **Change-Data-Capture Query** - Provides a change stream with records
inserted or updated or deleted since a point in time or between two points in
time. Provides both before and after images for each change record.
+* **Time-Travel Query** - Provides a view of the table, as of a given point in
time.
+* **Read Optimized Query** - Provides excellent snapshot query performance via
purely columnar storage (e.g. [Parquet](https://parquet.apache.org/)), when
used with a compaction policy to provide a transaction boundary.
+
+### Table Management
-* Upsert support with fast, pluggable indexing
-* Atomically publish data with rollback support
-* Snapshot isolation between writer & queries
-* Savepoints for data recovery
-* Manages file sizes, layout using statistics
-* Async compaction of row & columnar data
-* Timeline metadata to track lineage
-* Optimize data lake layout with clustering
-
-Hudi supports three types of queries:
- * **Snapshot Query** - Provides snapshot queries on real-time data, using a
combination of columnar & row-based storage (e.g
[Parquet](https://parquet.apache.org/) + [Avro](https://avro.apache.org/docs/)).
- * **Incremental Query** - Provides a change stream with records inserted or
updated after a point in time.
- * **Read Optimized Query** - Provides excellent snapshot query performance
via purely columnar storage (e.g. [Parquet](https://parquet.apache.org/)).
+* Automatic, hands-free table services runtime integrated into Spark/Flink
writers or operated independently.
+* Configurable scheduling strategies with built-in failure handling, for all
table services.
+* Cleaning older versions and time-to-live management to expire older data,
reclaim storage space.
+* Clustering and space-filling curve algorithms to optimize data layout with
pluggable scheduling strategies.
+* Asynchronous compaction of row oriented data into columnar formats, for
efficient streaming writers.
+* Consistent index building in face of ongoing queries or writers.
+* Catalog sync with Apache Hive Metastore, AWS Glue, Google BigQuery, Apache
XTable and more.
Learn more about Hudi at [https://hudi.apache.org](https://hudi.apache.org)
@@ -59,7 +94,7 @@ Learn more about Hudi at
[https://hudi.apache.org](https://hudi.apache.org)
Prerequisites for building Apache Hudi:
* Unix-like system (like Linux, Mac OS X)
-* Java 8 (Java 9 or 10 may work)
+* Java 8 (Java 9 or 11 may work)
* Git
* Maven (>=3.3.1)
diff --git a/doap_HUDI.rdf b/doap_HUDI.rdf
index 1f2b45a4899..c6bf37ad003 100644
--- a/doap_HUDI.rdf
+++ b/doap_HUDI.rdf
@@ -27,8 +27,8 @@
<name>Apache Hudi</name>
<homepage rdf:resource="https://hudi.apache.org" />
<asfext:pmc rdf:resource="https://hudi.apache.org" />
- <shortdesc>Ingests and Manages storage of large analytical
datasets</shortdesc>
- <description>Hudi (pronounced “Hoodie”) brings stream processing to big
data, providing upserts, deletes and incremental data streams.</description>
+ <shortdesc>high-performance open data lakehouse platform</shortdesc>
+ <description>Hudi brings transactions, stream processing, indexes,
mutability and incremental processing to data lakes.</description>
<bug-database rdf:resource="https://issues.apache.org/jira/browse/HUDI" />
<mailing-list rdf:resource="https://hudi.apache.org/community.html" />
<download-page rdf:resource="https://hudi.apache.org/community.html" />