incubator-griffin git commit: Fix doc bug and improve readability of Griffin project introduction

guoyp Mon, 23 Jul 2018 17:57:42 -0700

Repository: incubator-griffin
Updated Branches:
  refs/heads/master 7f4292273 -> 10afa997e



Fix doc bug and improve readability of Griffin project introduction

Author: Eugene <[email protected]>

Closes #368 from toyboxman/doc/fix2.


Project: http://git-wip-us.apache.org/repos/asf/incubator-griffin/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-griffin/commit/10afa997
Tree: http://git-wip-us.apache.org/repos/asf/incubator-griffin/tree/10afa997
Diff: http://git-wip-us.apache.org/repos/asf/incubator-griffin/diff/10afa997

Branch: refs/heads/master
Commit: 10afa997e26cbb9a4291bc3f2fcf8a96b29968ca
Parents: 7f42922
Author: Eugene <[email protected]>
Authored: Tue Jul 24 08:57:14 2018 +0800
Committer: William Guo <[email protected]>
Committed: Tue Jul 24 08:57:14 2018 +0800

----------------------------------------------------------------------
 griffin-doc/intro.md | 30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-griffin/blob/10afa997/griffin-doc/intro.md
----------------------------------------------------------------------
diff --git a/griffin-doc/intro.md b/griffin-doc/intro.md
index 9949956..8e8c93c 100644
--- a/griffin-doc/intro.md
+++ b/griffin-doc/intro.md
@@ -17,32 +17,32 @@ specific language governing permissions and limitations
 under the License.
 -->
 ## Abstract
-Apache Griffin is a Data Quality Service platform built on Apache Hadoop and 
Apache Spark. It provides a framework process for defining data quality model, 
executing data quality measurement, automating data profiling and validation, 
as well as a unified data quality visualization across multiple data systems.  
It tries to address the data quality challenges in big data and streaming 
context.
+Apache Griffin is a Data Quality Service Platform(**DQSP**) built on top of 
Apache Hadoop and Apache Spark. It provides a comprehensive framework that 
processes different tasks like defining data quality model, executing data 
quality measurement, automating data profiling and validation, as well as an 
unified data quality visualization across multiple data systems.  It aims to 
address challenges from data quality domain in big data applications.
 
 
 ## Overview of Apache Griffin  
-When people use big data (Hadoop or other streaming systems), measurement of 
data quality is a big challenge. Different teams have built customized tools to 
detect and analyze data quality issues within their own domains. As a platform 
organization, we think of taking a platform approach to commonly occurring 
patterns. As such, we are building a platform to provide shared Infrastructure 
and generic features to solve common data quality pain points. This would 
enable us to build trusted data assets.
+When people dances with big data (Hadoop or other streaming systems), they 
will hit a big challenge, measurement of data quality. Different teams have 
built customized tools to detect and analyze data quality issues within their 
own domains. However, it's extremely possible to take a platform approach as 
commonly occurring patterns, which is what we think. As such, we are building a 
platform to provide shared Infrastructure and generic features to solve common 
pain points of data quality. This would help us to build trusted data assets.
 
-Currently it is very difficult and costly to do data quality validation when 
we have large volumes of related data flowing across multi-platforms (streaming 
and batch). Take eBay's Real-time Personalization Platform as a sample; 
Everyday we have to validate the data quality for ~600M records. Data quality 
often becomes one big challenge in this complex environment and massive scale.
+Currently it is very difficult and costly to validate data quality when we 
have large volumes of related data flowing across multi-platforms (streaming 
and batch). Taking eBay's Real-time Personalization Platform as a sample, 
everyday we have to validate data quality about ~600M records. Data quality 
often becomes one big challenge in this complex and massive scale environment.
 
-We detect the following at eBay:
+We hit the following problems in eBay:
 
-1. Lack of an end-to-end, unified view of data quality from multiple data 
sources to target applications that takes into account the lineage of the data. 
This results in a long time to identify and fix data quality issues.
-2. Lack of a system to measure data quality in streaming mode through 
self-service. The need is for a system where datasets can be registered, data 
quality models can be defined, data quality can be visualized and monitored 
using a simple tool and teams alerted when an issue is detected.
-3. Lack of a Shared platform and API Service. Every team should not have to 
apply and manage own hardware and software infrastructure to solve this common 
problem.
+1. Lack of an end-to-end, unified view of data quality from multiple data 
sources to target applications that take into account the lineage of the data. 
This poses a long time to identify and fix data quality issues.
+2. Lack of an unified system to measure data quality in streaming mode through 
self-service. The system should be like a suitable composition where datasets 
can be registered, data quality models can be defined, data quality can be 
visualized and monitored using a simple tool and teams alerted when an issue is 
detected.
+3. Lack of a shared platform and exposed API services. Every team should not 
reinvent the wheel by its own and have not to apply and manage own hardware and 
software infrastructure to solve this common problem.
 
-With these in mind, we decided to build Apache Griffin - A data quality 
service that aims to solve the above short-comings.
+Considering these problems, we decided to build Apache Griffin - A data 
quality service that aims to solve the above shortcomings.
 
 Apache Griffin includes:
 
-**Data Quality Model Engine**: Apache Griffin is model driven solution, user 
can choose various data quality dimension to execute his/her data quality 
validation based on selected target data-set or source data-set ( as the golden 
reference data). It has corresponding library supporting it in back-end for the 
following measurement:
+**Data Quality Model Engine**: Apache Griffin is a model driven solution, 
users can choose various data quality dimension to execute their data quality 
validation based on selected target data-set or source data-set ( as the golden 
reference data). It has corresponding library supporting in back-end for the 
following measurement:
 
- - Accuracy - Does data reflect the real-world objects or a verifiable source
- - Completeness - Is all necessary data present
- - Validity -  Are all data values within the data domains specified by the 
business
- - Timeliness - Is the data available at the time needed
- - Anomaly detection -  Pre-built algorithm functions for the identification 
of items, events or observations which do not conform to an expected pattern or 
other items in a dataset
- - Data Profiling - Apply statistical analysis and assessment of data values 
within a dataset for consistency, uniqueness and logic.
+ - Accuracy - reflects the real-world objects or a verifiable source into data
+ - Completeness - keeps all necessary data present
+ - Validity -  corrects all data values within the data domains specified by 
the business
+ - Timeliness - keeps the data available at the time needed
+ - Anomaly detection -  pre-built algorithm functions for the identification 
of items, events or observations which do not conform to an expected pattern or 
other items in a dataset
+ - Data Profiling - applies statistical analysis and assessment of data values 
within a dataset for consistency, uniqueness and logic.
 
 **Data Collection Layer**:

incubator-griffin git commit: Fix doc bug and improve readability of Griffin project introduction

Reply via email to