+1. Data quality is critical to the success of data driven business.  This is a 
good initiative to draw people attention on ensuring data quality.

________________________________
From: Hao Chen <h...@apache.org>
Sent: December 1, 2016 6:01:15 AM
To: general@incubator.apache.org
Subject: Re: [VOTE] Bring Griffin to Apache Incubator

+1 (non-binding), Griffin looks like a very valuable project to fill the
challenging gap of data quality solution in big data domain.

- Hao

On Thu, Dec 1, 2016 at 2:40 PM, Henry Saputra <henry.sapu...@gmail.com>
wrote:

> Hi All,
>
> As the champion for Griffin, I would like to start VOTE to bring  the
> project as Apache incubator podling.
>
> Here is the direct quote from the abstract:
>
> "
> Griffin is a Data Quality Service platform built on Apache Hadoop and
> Apache Spark. It provides a framework process for defining data
> quality model, executing data quality measurement, automating data
> profiling and validation, as well as a unified data quality
> visualization across multiple data systems. It tries to address the
> data quality challenges in big data and streaming context.
> "
>
> Please cast your vote:
>
> [ ] +1, bring Griffin into Incubator
> [ ] +0, I don't care either way,
> [ ] -1, do not bring Griffin into Incubator, because...
>
> This vote will be open at least for 72 hours and only votes from the
> Incubator PMC are binding.
>
> The VOTE will end 12/5 9am PST to pass through weekend.
>
>
> Here is the link to the proposal:
>
> https://wiki.apache.org/incubator/GriffinProposal
>
> I have copied the proposal below for easy access
>
>
> Thanks,
>
> - Henry
>
>
>
> Griffin Proposal
>
> Abstract
>
> Griffin is a Data Quality Service platform built on Apache Hadoop and
> Apache Spark. It provides a framework process for defining data
> quality model, executing data quality measurement, automating data
> profiling and validation, as well as a unified data quality
> visualization across multiple data systems. It tries to address the
> data quality challenges in big data and streaming context.
>
> Proposal
>
> Griffin is a open source Data Quality solution for distributed data
> systems at any scale in both streaming or batch data context. When
> people use open source products (e.g. Apache Hadoop, Apache Spark,
> Apache Kafka, Apache Storm), they always need a data quality service
> to build his/her confidence on data quality processed by those
> platforms. Griffin creates a unified process to define and construct
> data quality measurement pipeline across multiple data systems to
> provide:
>
> Automatic quality validation of the data
> Data profiling and anomaly detection
> Data quality lineage from upstream to downstream data systems.
> Data quality health monitoring visualization
> Shared infrastructure resource management
>
> Overview of Griffin
>
> Griffin has been deployed in production at eBay serving major data
> systems, it takes a platform approach to provide generic features to
> solve common data quality validation pain points. Firstly, user can
> register the data asset which user wants to do data quality check. The
> data asset can be batch data in RDBMS (e.g.Teradata), Apache Hadoop
> system or near real-time streaming data from Apache Kafka, Apache
> Storm and other real time data platforms. Secondly, user can create
> data quality model to define the data quality rule and metadata.
> Thirdly, the model or rule will be executed automatically (by the
> model engine) to get the sample data quality validation results in a
> few seconds for streaming data. Finally, user can analyze the data
> quality results through built-in visualization tool to take actions.
>
> Griffin includes:
>
> Data Quality Model Engine
>
> Griffin is model driven solution, user can choose various data quality
> dimension to execute his/her data quality validation based on selected
> target data-set or source data-set ( as the golden reference data). It
> has a corresponding library supporting it in back-end for the
> following measurement:
>
> Accuracy - Does data reflect the real-world objects or a verifiable source
> Completeness - Is all necessary data present
> Validity - Are all data values within the data domains specified by the
> business
> Timeliness - Is the data available at the time needed
> Anomaly detection - Pre-built algorithm functions for the
> identification of items, events or observations which do not conform
> to an expected pattern or other items in a dataset
> Data Profiling - Apply statistical analysis and assessment of data
> values within a dataset for consistency, uniqueness and logic.
>
> Data Collection Layer
>
> We support two kinds of data sources, batch data and real time data.
>
> For batch mode, we can collect data source from Apache Hadoop based
> platform by various data connectors.
>
> For real time mode, we can connect with messaging system like Kafka to
> near real time analysis.
>
> Data Process and Storage Layer
>
> For batch analysis, our data quality model will compute data quality
> metrics in our spark cluster based on data source in Apache Hadoop.
>
> For near real time analysis, we consume data from messaging system,
> then our data quality model will compute our real time data quality
> metrics in our spark cluster. for data storage, we use time series
> database in our back end to fulfill front end request.
>
> Griffin Service
>
> We have RESTful web services to accomplish all the functionalities of
> Griffin, such as register data asset, create data quality model,
> publish metrics, retrieve metrics, add subscription, etc. So, the
> developers can develop their own user interface based on these web
> services.
>
> Background
>
> At eBay, when people play with big data in Apache Hadoop (or other
> streaming data), data quality often becomes one big challenge.
> Different teams have built customized data quality tools to detect and
> analyze data quality issues within their own domain. We are thinking
> to take a platform approach to provide shared Infrastructure and
> generic features to solve common data quality pain points. This would
> enable us to build trusted data assets.
>
> Currently it’s very difficult and costly to do data quality validation
> when we have big data flow across multi-platforms at eBay (e.g.
> Oracle, Apache Hadoop, Couchbase, Apache Cassandra, Apache Kafka,
> MongoDB). Take eBay real time personalization platform as an example.
> Every day we have to validate data quality status for ~600M records (
> imagine we have 150M active users for our website). Data quality often
> becomes one big challenge both in its streaming and batch pipelines.
>
> So we conclude 3 data quality problems at eBay:
>
> Lack of end2end unified view of data quality measurement from multiple
> data sources to target applications, it usually takes a long time to
> identify and fix poor data quality.
> How to get data quality measured in streaming mode, we need to have a
> process and tool to visualize data quality insights through
> registering dataset which you want to check data quality, creating
> data quality measurement model, executing the data quality validation
> job and getting metrics insights for action taking.
> No Shared platform and API Service, have to apply and manage own
> hardware and software infrastructure.
>
> Rationale
>
> The challenge we face at eBay is that our data volume is becoming
> bigger and bigger, system processes become more complex, while we do
> not have a unified data quality solution to ensure the trusted data
> sets which provide confidences on data quality to our data consumers.
> The key challenges on data quality includes:
>
> Existing commercial data quality solution cannot address data quality
> lineage among systems, cannot scale out to support fast growing data
> at eBay
> Existing eBay's domain specific tools take a long time to identify and
> fix poor data quality when data flowed through multiple systems
> Business logic becomes complex, requires data quality system much flexible.
>
> Some data quality issues do have business impact on user experiences,
> revenue, efficiency & compliance.
>
> Communication overhead of data quality metrics, typically in a big
> organization, which involve different teams.
>
> The idea of Griffin is to provide Data Quality validation as a
> Service, to allow data engineers and data consumers to have:
>
> Near real-time understanding of the data quality health of your data
> pipelines with end-to-end monitoring, all in one place.
> Profiling, detecting and correlating issues and providing
> recommendations that drive rapid and focused troubleshooting
> A centralized data quality model management system including rule,
> metadata, scheduler etc.
> Native code generation to run everywhere, including Hadoop, Kafka, Spark,
> etc.
> One set of tools to build data quality pipelines across all eBay data
> platforms.
>
> Current Status
>
> Meritocracy
>
> Griffin has been deployed in production at eBay and provided the
> centralized data quality service for several eBay systems ( for
> example, real time personalization platform, eBay real time ID linking
> platform, Hadoop datasets, Site speed analytics platform). Our aim is
> to build a diverse developer and user community following the Apache
> meritocracy model. We will encourage contributions and participation
> of all types of work, and ensure that contributors are appropriately
> recognized.
>
> Community
>
> Currently the project is being developed at eBay. It's only for eBay
> internal community. Griffin seeks to develop the developer and user
> communities during incubation. We believe it will grow substantially
> by becoming an Apache project.
>
> Core Developers
>
> Griffin is currently being designed and developed by engineers from
> eBay Inc. – William Guo, Alex Lv, Shawn Sha, Vincent Zhao, John Liu.
> All of these core developers have deep expertise in Apache Hadoop and
> the Hadoop Ecosystem in general.
>
> Alignment
>
> The ASF is a natural host for Griffin given that it is already the
> home of Hadoop, Beam, HBase, Hive, Storm, Kafka, Spark and other
> emerging big data products. Those are requiring data quality solution
> by nature to ensure the data quality which they processed. When people
> use open source data technology, the big question to them is that how
> we can ensure the data quality in it. Griffin leverages lot of Apache
> open-source products. Griffin was designed to enable real time
> insights into data quality validation by shared Infrastructure and
> generic features to solve common data quality pain points.
>
> Known Risks
>
> Orphaned Products
>
> The core developers of Griffin team work full time on this project.
> There is no risk of Griffin getting orphaned since at least one large
> company (eBay) is extensively using it in their production Hadoop and
> Spark clusters for multiple data systems. For example, currently there
> are 4 data systems at eBay (real time personalization platform, eBay
> real time ID linking platform, Hadoop, Site speed analytics platform)
> are leveraging Griffin, with more than ~600M records for data quality
> status validation every day, 35 data sets being monitored, 50+ data
> quality models have been created.
>
> As Griffin is designed to connect many types of data sources, we are
> very confident that they will use Griffin as a service for ensuring
> the data quality in open source data ecosystems. We plan to extend and
> diversify this community further through Apache.
>
> Inexperience with Open Source
>
> Griffin's core engineers are all active users and followers of open
> source projects. They are already committers and contributors to the
> Griffin Github project. All have been involved with the source code
> that has been released under an open source license, and several of
> them also have experience developing code in an open source
> environment. Though the core set of Developers do not have Apache Open
> Source experience, there are plans to onboard individuals with Apache
> open source experience on to the project.
>
> Homogenous Developers
>
> The core developers are from eBay. Apache Incubation process
> encourages an open and diverse meritocratic community. Griffin intends
> to make every possible effort to build a diverse, vibrant and involved
> community. We are committed to recruiting additional committers from
> other companies based on their contribution to the project.
>
> Reliance on Salaried Developers
>
> eBay invested in Griffin as a company-wide data quality service
> platform and some of its key engineers are working full time on the
> project. they are all paid by eBay. We look forward to other Apache
> developers and researchers to contribute to the project.
>
> Relationships with Other Apache Products
>
> Griffin has a strong relationship and dependency with Apache Hadoop,
> Apache HBase, Apache Spark, Apache Kafka and Apache Storm, Apache
> Hive. In addition, since there is a growing need for data quality
> solution for open source platform (e.g. Hadoop, Kafka, Spark etc),
> being part of Apache’s Incubation community, could help with a closer
> collaboration among these four projects and as well as others.
>
> Documentation
>
> Information about Griffin can be found at https://github.com/eBay/griffin
>
> Initial Source
>
> Griffin has been under development since early 2016 by a team of
> engineers at eBay Inc. It is currently hosted on Github.com under an
> Apache license 2.0 at https://github.com/eBay/griffin . Once in
> incubation we will be moving the code base to apache git library.
>
> External Dependencies
>
> Griffin has the following external dependencies.
>
> Basic
>
> JDK 1.7+
> Scala
> Apache Maven
> JUnit
> Log4j
> Slf4j
> Apache Commons
>
> Hadoop
>
> Apache Hadoop
> Apache HBase
> Apache Hive
>
> DB
>
> InfluxData
>
> Apache Spark
>
> Spark Core Library
>
> REST Service
>
> Jersey
> Spring MVC
>
> Web frontend
>
> AngularJS
> jQuery
> Bootstrap
> RequireJS
> eCharts
> Font Awesome
>
> Cryptography
>
> Currently there's no cryptography in Griffin.
>
> Required Resources
>
> Mailing List
>
> We currently use eBay mail box to communicate, but we'd like to move
> that to ASF maintained mailing lists.
>
> Current mailing list: ebay-griffin-d...@googlegroups.com
>
> Proposed ASF maintained lists:
>
> priv...@griffin.incubator.apache.org
>
> d...@griffin.incubator.apache.org
>
> comm...@griffin.incubator.apache.org
>
> Subversion Directory
>
> Git is the preferred source control system.
>
> Issue Tracking
>
> JIRA
>
> Other Resources
>
> The existing code already has unit tests so we will make use of
> existing Apache continuous testing infrastructure. The resulting load
> should not be very large.
>
> Initial Committers
>
> William Go
> Alex Lv
> Vincent Zhao
> Shawn Sha
> John Liu
> Liang Shao
>
> Affiliations
>
> The initial committers are employees of eBay Inc.
>
> Sponsors
>
> Champion
>
> Henry Saputra (hsapu...@apache.org)
>
> Nominated Mentors
>
> Kasper Sørensen (kasper...@apache.org)
>
> Uma Maheswara Rao Gangumalla (umamah...@apache.org)
>
> Luciano Resende (luckbr1...@gmail.com)
>
> Sponsoring Entity
>
> We are requesting the Incubator to sponsor this project.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
>
>

Reply via email to