[GitHub] [incubator-gobblin] sv2000 commented on a change in pull request #3212: [GOBBLIN-1371] Improve README to reflect current capabilities

GitBox Tue, 26 Jan 2021 09:53:07 -0800


sv2000 commented on a change in pull request #3212:
URL: https://github.com/apache/incubator-gobblin/pull/3212#discussion_r564710086




##########
File path: README.md
##########
@@ -6,11 +6,32 @@
 [![Join us on 
Slack](https://img.shields.io/badge/slack-apache--gobblin-brightgreen.svg)](https://communityinviter.com/apps/apache-gobblin/apache-gobblin)
 
[![codecov.io](https://codecov.io/github/apache/incubator-gobblin/branch/master/graph/badge.svg)](https://codecov.io/github/apache/incubator-gobblin)
 
-Apache Gobblin is a universal data ingestion framework for extracting, 
transforming, and loading large volume of data from a variety of data sources: 
databases, rest APIs, FTP/SFTP servers, filers, etc., onto Hadoop. 
+Apache Gobblin is a highly scalable data management solution for structured 
and byte-oriented data in heterogeneous data ecosystems. 
 
-Apache Gobblin handles the common routine tasks required for all data 
ingestion ETLs, including job/task scheduling, task partitioning, error 
handling, state management, data quality checking, data publishing, etc. 
+It offers the following capabilities:
+- Ingestion and export of data from a variety of sources and sinks into and 
out of the data lake. Gobblin is optimized and designed for ELT patterns with 
inline transformations on ingest (small t).

Review comment:
       Do you think we can move this sentence to the top para after "Apace 
Gobblin is a highly scalable..."?

##########
File path: README.md
##########
@@ -6,11 +6,32 @@
 [![Join us on 
Slack](https://img.shields.io/badge/slack-apache--gobblin-brightgreen.svg)](https://communityinviter.com/apps/apache-gobblin/apache-gobblin)
 
[![codecov.io](https://codecov.io/github/apache/incubator-gobblin/branch/master/graph/badge.svg)](https://codecov.io/github/apache/incubator-gobblin)
 
-Apache Gobblin is a universal data ingestion framework for extracting, 
transforming, and loading large volume of data from a variety of data sources: 
databases, rest APIs, FTP/SFTP servers, filers, etc., onto Hadoop. 
+Apache Gobblin is a highly scalable data management solution for structured 
and byte-oriented data in heterogeneous data ecosystems. 
 
-Apache Gobblin handles the common routine tasks required for all data 
ingestion ETLs, including job/task scheduling, task partitioning, error 
handling, state management, data quality checking, data publishing, etc. 
+It offers the following capabilities:
+- Ingestion and export of data from a variety of sources and sinks into and 
out of the data lake. Gobblin is optimized and designed for ELT patterns with 
inline transformations on ingest (small t).
+- Data Organization within the lake (e.g. compaction, partitioning, 
deduplication)
+- Lifecycle Management of data within the lake (e.g. data retention)
+- Compliance Management of data across the ecosystem (e.g. fine-grain data 
deletions)
+
+Common Patterns used in production
+- Stream / Batch ingestion of Kafka to Data Lake (HDFS, S3, ADLS)
+- Bulk-loading serving stores from the Data Lake (e.g. HDFS -> Couchbase)
+- Support for data sync across Federated Data Lake (HDFS <-> HDFS, HDFS <-> 
S3, S3 <-> ADLS)
+- Integrate external vendor API-s (e.g. Salesforce, Dynamics etc.) with data 
store (HDFS, Couchbase etc)
+- Enforcing Data retention policies and GDPR deletion on HDFS / ADLS
+
+Highlights
+- Battle tested at scale: Runs in production at petabyte-scale at companies 
like LinkedIn, PayPal, Verizon etc.
+- Feature rich: Supports job/task scheduling, task partitioning, fault 
tolerance, error handling, state management for incremental processing, data 
quality checking, atomic data publishing etc.
+- Supports stream and batch execution modes 
+- Control Plane (Gobblin-as-a-service) supports programmatic triggering and 
orchestration of data plane operations. 
+
+Apache Gobblin is NOT
+- A general purpose data transformation engine like Spark or Flink. Gobblin 
can delegate complex-data processing tasks to Spark, Hive etc. 
+- A data storage system like Apache Kafka or HDFS. Gobblin integrates with 
these systems as sources or sinks. 
+- A general-purpose workflow execution system like Airflow, Azkaban, Dagster, 
Luigi. You can use these systems to kick-off Gobblin jobs or use Gobblin’s 
in-built Quartz based scheduler. 

Review comment:
       We can mention here that Gobblin-as-a-Service has a workflow execution 
system for scheduling and orchestrating Gobblin jobs.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [incubator-gobblin] sv2000 commented on a change in pull request #3212: [GOBBLIN-1371] Improve README to reflect current capabilities

Reply via email to