This is an automated email from the ASF dual-hosted git repository.
shirshanka pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-gobblin.git
The following commit(s) were added to refs/heads/master by this push:
new f6e0632 [GOBBLIN-1371] Improve README to reflect current capabilities
f6e0632 is described below
commit f6e063245df709c6981226a98129087cef0f9d4f
Author: Shirshanka Das <[email protected]>
AuthorDate: Thu Jan 28 10:03:28 2021 -0800
[GOBBLIN-1371] Improve README to reflect current capabilities
Closes #3212 from shirshanka/better-readme
---
README.md | 28 +++++++++++++++++++++++++---
1 file changed, 25 insertions(+), 3 deletions(-)
diff --git a/README.md b/README.md
index 5d608d2..f985273 100644
--- a/README.md
+++ b/README.md
@@ -6,11 +6,33 @@
[](https://communityinviter.com/apps/apache-gobblin/apache-gobblin)
[](https://codecov.io/github/apache/incubator-gobblin)
-Apache Gobblin is a universal data ingestion framework for extracting,
transforming, and loading large volume of data from a variety of data sources:
databases, rest APIs, FTP/SFTP servers, filers, etc., onto Hadoop.
+Apache Gobblin is a highly scalable data management solution for structured
and byte-oriented data in heterogeneous data ecosystems.
-Apache Gobblin handles the common routine tasks required for all data
ingestion ETLs, including job/task scheduling, task partitioning, error
handling, state management, data quality checking, data publishing, etc.
+### Capabilities
+- Ingestion and export of data from a variety of sources and sinks into and
out of the data lake. Gobblin is optimized and designed for ELT patterns with
inline transformations on ingest (small t).
+- Data Organization within the lake (e.g. compaction, partitioning,
deduplication)
+- Lifecycle Management of data within the lake (e.g. data retention)
+- Compliance Management of data across the ecosystem (e.g. fine-grain data
deletions)
+
+### Highlights
+- Battle tested at scale: Runs in production at petabyte-scale at companies
like LinkedIn, PayPal, Verizon etc.
+- Feature rich: Supports task partitioning, state management for incremental
processing, atomic data publishing, data quality checking, job scheduling,
fault tolerance etc.
+- Supports stream and batch execution modes
+- Control Plane (Gobblin-as-a-service) supports programmatic triggering and
orchestration of data plane operations.
+
+### Common Patterns used in production
+- Stream / Batch ingestion of Kafka to Data Lake (HDFS, S3, ADLS)
+- Bulk-loading serving stores from the Data Lake (e.g. HDFS -> Couchbase)
+- Support for data sync across Federated Data Lake (HDFS <-> HDFS, HDFS <->
S3, S3 <-> ADLS)
+- Integrate external vendor API-s (e.g. Salesforce, Dynamics etc.) with data
store (HDFS, Couchbase etc)
+- Enforcing Data retention policies and GDPR deletion on HDFS / ADLS
+
+
+### Apache Gobblin is NOT
+- A general purpose data transformation engine like Spark or Flink. Gobblin
can delegate complex-data processing tasks to Spark, Hive etc.
+- A data storage system like Apache Kafka or HDFS. Gobblin integrates with
these systems as sources or sinks.
+- A general-purpose workflow execution system like Airflow, Azkaban, Dagster,
Luigi.
-Gobblin ingests data from different data sources in the same execution
framework, and manages metadata of different sources all in one place. This,
combined with other features such as auto scalability, fault tolerance, data
quality assurance, extensibility, and the ability of handling data model
evolution, makes Gobblin an easy-to-use, self-serving, and efficient data
ingestion framework.
# Requirements
* Java >= 1.8