Repository: incubator-griffin Updated Branches: refs/heads/master 7f4292273 -> 10afa997e
Fix doc bug and improve readability of Griffin project introduction Author: Eugene <[email protected]> Closes #368 from toyboxman/doc/fix2. Project: http://git-wip-us.apache.org/repos/asf/incubator-griffin/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-griffin/commit/10afa997 Tree: http://git-wip-us.apache.org/repos/asf/incubator-griffin/tree/10afa997 Diff: http://git-wip-us.apache.org/repos/asf/incubator-griffin/diff/10afa997 Branch: refs/heads/master Commit: 10afa997e26cbb9a4291bc3f2fcf8a96b29968ca Parents: 7f42922 Author: Eugene <[email protected]> Authored: Tue Jul 24 08:57:14 2018 +0800 Committer: William Guo <[email protected]> Committed: Tue Jul 24 08:57:14 2018 +0800 ---------------------------------------------------------------------- griffin-doc/intro.md | 30 +++++++++++++++--------------- 1 file changed, 15 insertions(+), 15 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-griffin/blob/10afa997/griffin-doc/intro.md ---------------------------------------------------------------------- diff --git a/griffin-doc/intro.md b/griffin-doc/intro.md index 9949956..8e8c93c 100644 --- a/griffin-doc/intro.md +++ b/griffin-doc/intro.md @@ -17,32 +17,32 @@ specific language governing permissions and limitations under the License. --> ## Abstract -Apache Griffin is a Data Quality Service platform built on Apache Hadoop and Apache Spark. It provides a framework process for defining data quality model, executing data quality measurement, automating data profiling and validation, as well as a unified data quality visualization across multiple data systems. It tries to address the data quality challenges in big data and streaming context. +Apache Griffin is a Data Quality Service Platform(**DQSP**) built on top of Apache Hadoop and Apache Spark. It provides a comprehensive framework that processes different tasks like defining data quality model, executing data quality measurement, automating data profiling and validation, as well as an unified data quality visualization across multiple data systems. It aims to address challenges from data quality domain in big data applications. ## Overview of Apache Griffin -When people use big data (Hadoop or other streaming systems), measurement of data quality is a big challenge. Different teams have built customized tools to detect and analyze data quality issues within their own domains. As a platform organization, we think of taking a platform approach to commonly occurring patterns. As such, we are building a platform to provide shared Infrastructure and generic features to solve common data quality pain points. This would enable us to build trusted data assets. +When people dances with big data (Hadoop or other streaming systems), they will hit a big challenge, measurement of data quality. Different teams have built customized tools to detect and analyze data quality issues within their own domains. However, it's extremely possible to take a platform approach as commonly occurring patterns, which is what we think. As such, we are building a platform to provide shared Infrastructure and generic features to solve common pain points of data quality. This would help us to build trusted data assets. -Currently it is very difficult and costly to do data quality validation when we have large volumes of related data flowing across multi-platforms (streaming and batch). Take eBay's Real-time Personalization Platform as a sample; Everyday we have to validate the data quality for ~600M records. Data quality often becomes one big challenge in this complex environment and massive scale. +Currently it is very difficult and costly to validate data quality when we have large volumes of related data flowing across multi-platforms (streaming and batch). Taking eBay's Real-time Personalization Platform as a sample, everyday we have to validate data quality about ~600M records. Data quality often becomes one big challenge in this complex and massive scale environment. -We detect the following at eBay: +We hit the following problems in eBay: -1. Lack of an end-to-end, unified view of data quality from multiple data sources to target applications that takes into account the lineage of the data. This results in a long time to identify and fix data quality issues. -2. Lack of a system to measure data quality in streaming mode through self-service. The need is for a system where datasets can be registered, data quality models can be defined, data quality can be visualized and monitored using a simple tool and teams alerted when an issue is detected. -3. Lack of a Shared platform and API Service. Every team should not have to apply and manage own hardware and software infrastructure to solve this common problem. +1. Lack of an end-to-end, unified view of data quality from multiple data sources to target applications that take into account the lineage of the data. This poses a long time to identify and fix data quality issues. +2. Lack of an unified system to measure data quality in streaming mode through self-service. The system should be like a suitable composition where datasets can be registered, data quality models can be defined, data quality can be visualized and monitored using a simple tool and teams alerted when an issue is detected. +3. Lack of a shared platform and exposed API services. Every team should not reinvent the wheel by its own and have not to apply and manage own hardware and software infrastructure to solve this common problem. -With these in mind, we decided to build Apache Griffin - A data quality service that aims to solve the above short-comings. +Considering these problems, we decided to build Apache Griffin - A data quality service that aims to solve the above shortcomings. Apache Griffin includes: -**Data Quality Model Engine**: Apache Griffin is model driven solution, user can choose various data quality dimension to execute his/her data quality validation based on selected target data-set or source data-set ( as the golden reference data). It has corresponding library supporting it in back-end for the following measurement: +**Data Quality Model Engine**: Apache Griffin is a model driven solution, users can choose various data quality dimension to execute their data quality validation based on selected target data-set or source data-set ( as the golden reference data). It has corresponding library supporting in back-end for the following measurement: - - Accuracy - Does data reflect the real-world objects or a verifiable source - - Completeness - Is all necessary data present - - Validity - Are all data values within the data domains specified by the business - - Timeliness - Is the data available at the time needed - - Anomaly detection - Pre-built algorithm functions for the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset - - Data Profiling - Apply statistical analysis and assessment of data values within a dataset for consistency, uniqueness and logic. + - Accuracy - reflects the real-world objects or a verifiable source into data + - Completeness - keeps all necessary data present + - Validity - corrects all data values within the data domains specified by the business + - Timeliness - keeps the data available at the time needed + - Anomaly detection - pre-built algorithm functions for the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset + - Data Profiling - applies statistical analysis and assessment of data values within a dataset for consistency, uniqueness and logic. **Data Collection Layer**:
