(hudi) branch master updated: [MINOR] Change README (#12348)

vinoth Wed, 27 Nov 2024 09:12:48 -0800

This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/master by this push:
     new aa0f57c6036 [MINOR] Change README (#12348)
aa0f57c6036 is described below

commit aa0f57c60361a7c03ec99da56c2264452f89927d
Author: vinoth chandar <[email protected]>
AuthorDate: Wed Nov 27 09:12:35 2024 -0800

    [MINOR] Change README (#12348)
    
    - Better, accurate description for the project based on current state.
     - break down features by areas.
---
 README.md     | 69 ++++++++++++++++++++++++++++++++++++++++++++---------------
 doap_HUDI.rdf |  4 ++--
 2 files changed, 54 insertions(+), 19 deletions(-)

diff --git a/README.md b/README.md
index e79a294c034..fecbed44b0b 100644
--- a/README.md
+++ b/README.md
@@ -18,8 +18,8 @@
 
 # Apache Hudi
 
-Apache Hudi (pronounced Hoodie) stands for `Hadoop Upserts Deletes and 
Incrementals`. Hudi manages the storage of large
-analytical datasets on DFS (Cloud stores, HDFS or any Hadoop FileSystem 
compatible storage).
+Apache Hudi is a open data lakehouse platform, built on a high-performance 
open table format 
+to ingest, index, store, serve, transform and manage your data across multiple 
cloud data environments.
 
 <img src="https://hudi.apache.org/assets/images/hudi-logo-medium.png"; 
alt="Hudi logo" height="80px" align="right" />
 
@@ -34,23 +34,58 @@ analytical datasets on DFS (Cloud stores, HDFS or any 
Hadoop FileSystem compatib
 [![Twitter 
Follow](https://img.shields.io/twitter/follow/ApacheHudi)](https://twitter.com/apachehudi)
 [![Follow 
Linkedin](https://img.shields.io/badge/apache%E2%80%93hudi-0077B5?style=for-the-badge&logo=linkedin&logoColor=white&label=Follow)](https://www.linkedin.com/company/apache-hudi/?viewAsMember=true)
 
+## Features
 
+Hudi stores all data and metadata on cloud storage in open formats, providing 
the following features across different aspects.
 
-## Features
+### Ingestion
+
+* Built-in ingestion tools for Apache Spark/Apache Flink users.
+* Supports half-dozen file formats, database change logs and streaming data 
systems.
+* Connect sink for Apache Kafka, to bring external data sources.
+
+### Storage
+
+* Optimized storage format, supporting row & columnar data.
+* Timeline metadata to track history of changes
+* Automatically manages file sizes, layout using statistics
+* Savepoints for data versioning and recovery
+* Schema tracking and evolution.
+
+### Indexing
+
+* Scalable indexing subsystem to speed up snapshot queries, maintained 
automatically by writes.
+* Tracks file listings, column-level and partition-level statistics to help 
plan queries efficiently.
+* Record-level indexing mechanisms built on row-oriented file formats and 
bloom filters.
+* Logical partitioning on tables, using expression indexes to decouple from 
physical partitioning on storage.
+
+### Writing
+
+* Atomically commit data with rollback/restore support.
+* Fast upsert/delete support leveraging record-level indexes.
+* Snapshot isolation between writer & queries.
+* Optimistic concurrency control to implement relational data model, with 
Read-Modify-Write style consistent writes.
+* Non-blocking concurrency control, to implement streaming data model, with 
support for out-of-order, late data handling.
+
+### Queries
+
+Hudi supports different types of queries, on top of a single table. 
+
+* **Snapshot Query** - Provides a view of the table, as of the latest 
committed state, accelerated with indexes as applicable.
+* **Incremental Query** - Provides latest value of records inserted/updated, 
since a given point in time of the table. Can be used to "diff" table states 
between two points in time.
+* **Change-Data-Capture Query** - Provides a change stream with records 
inserted or updated or deleted since a point in time or between two points in 
time. Provides both before and after images for each change record.
+* **Time-Travel Query** - Provides a view of the table, as of a given point in 
time.
+* **Read Optimized Query** - Provides excellent snapshot query performance via 
purely columnar storage (e.g. [Parquet](https://parquet.apache.org/)), when 
used with a compaction policy to provide a transaction boundary.
+
+### Table Management
 
-* Upsert support with fast, pluggable indexing
-* Atomically publish data with rollback support
-* Snapshot isolation between writer & queries
-* Savepoints for data recovery
-* Manages file sizes, layout using statistics
-* Async compaction of row & columnar data
-* Timeline metadata to track lineage
-* Optimize data lake layout with clustering
- 
-Hudi supports three types of queries:
- * **Snapshot Query** - Provides snapshot queries on real-time data, using a 
combination of columnar & row-based storage (e.g 
[Parquet](https://parquet.apache.org/) + [Avro](https://avro.apache.org/docs/)).
- * **Incremental Query** - Provides a change stream with records inserted or 
updated after a point in time.
- * **Read Optimized Query** - Provides excellent snapshot query performance 
via purely columnar storage (e.g. [Parquet](https://parquet.apache.org/)).
+* Automatic, hands-free table services runtime integrated into Spark/Flink 
writers or operated independently. 
+* Configurable scheduling strategies with built-in failure handling, for all 
table services.
+* Cleaning older versions and time-to-live management to expire older data, 
reclaim storage space.
+* Clustering and space-filling curve algorithms to optimize data layout with 
pluggable scheduling strategies.
+* Asynchronous compaction of row oriented data into columnar formats, for 
efficient streaming writers.
+* Consistent index building in face of ongoing queries or writers.
+* Catalog sync with Apache Hive Metastore, AWS Glue, Google BigQuery, Apache 
XTable and more.
 
 Learn more about Hudi at [https://hudi.apache.org](https://hudi.apache.org)
 
@@ -59,7 +94,7 @@ Learn more about Hudi at 
[https://hudi.apache.org](https://hudi.apache.org)
 Prerequisites for building Apache Hudi:
 
 * Unix-like system (like Linux, Mac OS X)
-* Java 8 (Java 9 or 10 may work)
+* Java 8 (Java 9 or 11 may work)
 * Git
 * Maven (>=3.3.1)
 
diff --git a/doap_HUDI.rdf b/doap_HUDI.rdf
index 1f2b45a4899..c6bf37ad003 100644
--- a/doap_HUDI.rdf
+++ b/doap_HUDI.rdf
@@ -27,8 +27,8 @@
     <name>Apache Hudi</name>
     <homepage rdf:resource="https://hudi.apache.org"; />
     <asfext:pmc rdf:resource="https://hudi.apache.org"; />
-    <shortdesc>Ingests and Manages storage of large analytical 
datasets</shortdesc>
-    <description>Hudi (pronounced “Hoodie”) brings stream processing to big 
data, providing upserts, deletes and incremental data streams.</description>
+    <shortdesc>high-performance open data lakehouse platform</shortdesc>
+    <description>Hudi brings transactions, stream processing, indexes, 
mutability and incremental processing to data lakes.</description>
     <bug-database rdf:resource="https://issues.apache.org/jira/browse/HUDI"; />
     <mailing-list rdf:resource="https://hudi.apache.org/community.html"; />
     <download-page rdf:resource="https://hudi.apache.org/community.html"; />

(hudi) branch master updated: [MINOR] Change README (#12348)

Reply via email to