Re: [VOTE] Bring Griffin to Apache Incubator

2016-12-01 Thread Lv Alex
+1(non-binding)

发自我的 iPhone

> 在 2016年12月2日,上午9:00,Kasper Sørensen  写道:
> 
> +1 (binding)
> 
> 2016-12-01 17:58 GMT-08:00 Julian Hyde :
> 
>> +1 (binding)
>> 
>>> On Dec 1, 2016, at 3:30 PM, Liang Chen  wrote:
>>> 
>>> Hi
>>> 
>>> +1(non-binding)
>>> 
>>> Regards
>>> Liang
>>> 
>>> Henry Saputra wrote
 Hi All,
 
 As the champion for Griffin, I would like to start VOTE to bring  the
 project as Apache incubator podling.
 
 Here is the direct quote from the abstract:
 
 "
 Griffin is a Data Quality Service platform built on Apache Hadoop and
 Apache Spark. It provides a framework process for defining data
 quality model, executing data quality measurement, automating data
 profiling and validation, as well as a unified data quality
 visualization across multiple data systems. It tries to address the
 data quality challenges in big data and streaming context.
 "
 
 Please cast your vote:
 
 [ ] +1, bring Griffin into Incubator
 [ ] +0, I don't care either way,
 [ ] -1, do not bring Griffin into Incubator, because...
 
 This vote will be open at least for 72 hours and only votes from the
 Incubator PMC are binding.
 
 The VOTE will end 12/5 9am PST to pass through weekend.
 
 
 Here is the link to the proposal:
 
 https://wiki.apache.org/incubator/GriffinProposal
 
 I have copied the proposal below for easy access
 
 
 Thanks,
 
 - Henry
 
 
 
 Griffin Proposal
 
 Abstract
 
 Griffin is a Data Quality Service platform built on Apache Hadoop and
 Apache Spark. It provides a framework process for defining data
 quality model, executing data quality measurement, automating data
 profiling and validation, as well as a unified data quality
 visualization across multiple data systems. It tries to address the
 data quality challenges in big data and streaming context.
 
 Proposal
 
 Griffin is a open source Data Quality solution for distributed data
 systems at any scale in both streaming or batch data context. When
 people use open source products (e.g. Apache Hadoop, Apache Spark,
 Apache Kafka, Apache Storm), they always need a data quality service
 to build his/her confidence on data quality processed by those
 platforms. Griffin creates a unified process to define and construct
 data quality measurement pipeline across multiple data systems to
 provide:
 
 Automatic quality validation of the data
 Data profiling and anomaly detection
 Data quality lineage from upstream to downstream data systems.
 Data quality health monitoring visualization
 Shared infrastructure resource management
 
 Overview of Griffin
 
 Griffin has been deployed in production at eBay serving major data
 systems, it takes a platform approach to provide generic features to
 solve common data quality validation pain points. Firstly, user can
 register the data asset which user wants to do data quality check. The
 data asset can be batch data in RDBMS (e.g.Teradata), Apache Hadoop
 system or near real-time streaming data from Apache Kafka, Apache
 Storm and other real time data platforms. Secondly, user can create
 data quality model to define the data quality rule and metadata.
 Thirdly, the model or rule will be executed automatically (by the
 model engine) to get the sample data quality validation results in a
 few seconds for streaming data. Finally, user can analyze the data
 quality results through built-in visualization tool to take actions.
 
 Griffin includes:
 
 Data Quality Model Engine
 
 Griffin is model driven solution, user can choose various data quality
 dimension to execute his/her data quality validation based on selected
 target data-set or source data-set ( as the golden reference data). It
 has a corresponding library supporting it in back-end for the
 following measurement:
 
 Accuracy - Does data reflect the real-world objects or a verifiable
>> source
 Completeness - Is all necessary data present
 Validity - Are all data values within the data domains specified by the
 business
 Timeliness - Is the data available at the time needed
 Anomaly detection - Pre-built algorithm functions for the
 identification of items, events or observations which do not conform
 to an expected pattern or other items in a dataset
 Data Profiling - Apply statistical analysis and assessment of data
 values within a dataset for consistency, uniqueness and logic.
 
 Data Collection Layer
 
 We support two kinds of data sources, batch data and real time data.
 
 For batch mode, we can collect data source from Apache Hadoop based
 

Re: [VOTE] Bring Griffin to Apache Incubator

2016-12-01 Thread Kasper Sørensen
+1 (binding)

2016-12-01 17:58 GMT-08:00 Julian Hyde :

> +1 (binding)
>
> > On Dec 1, 2016, at 3:30 PM, Liang Chen  wrote:
> >
> > Hi
> >
> > +1(non-binding)
> >
> > Regards
> > Liang
> >
> > Henry Saputra wrote
> >> Hi All,
> >>
> >> As the champion for Griffin, I would like to start VOTE to bring  the
> >> project as Apache incubator podling.
> >>
> >> Here is the direct quote from the abstract:
> >>
> >> "
> >> Griffin is a Data Quality Service platform built on Apache Hadoop and
> >> Apache Spark. It provides a framework process for defining data
> >> quality model, executing data quality measurement, automating data
> >> profiling and validation, as well as a unified data quality
> >> visualization across multiple data systems. It tries to address the
> >> data quality challenges in big data and streaming context.
> >> "
> >>
> >> Please cast your vote:
> >>
> >> [ ] +1, bring Griffin into Incubator
> >> [ ] +0, I don't care either way,
> >> [ ] -1, do not bring Griffin into Incubator, because...
> >>
> >> This vote will be open at least for 72 hours and only votes from the
> >> Incubator PMC are binding.
> >>
> >> The VOTE will end 12/5 9am PST to pass through weekend.
> >>
> >>
> >> Here is the link to the proposal:
> >>
> >> https://wiki.apache.org/incubator/GriffinProposal
> >>
> >> I have copied the proposal below for easy access
> >>
> >>
> >> Thanks,
> >>
> >> - Henry
> >>
> >>
> >>
> >> Griffin Proposal
> >>
> >> Abstract
> >>
> >> Griffin is a Data Quality Service platform built on Apache Hadoop and
> >> Apache Spark. It provides a framework process for defining data
> >> quality model, executing data quality measurement, automating data
> >> profiling and validation, as well as a unified data quality
> >> visualization across multiple data systems. It tries to address the
> >> data quality challenges in big data and streaming context.
> >>
> >> Proposal
> >>
> >> Griffin is a open source Data Quality solution for distributed data
> >> systems at any scale in both streaming or batch data context. When
> >> people use open source products (e.g. Apache Hadoop, Apache Spark,
> >> Apache Kafka, Apache Storm), they always need a data quality service
> >> to build his/her confidence on data quality processed by those
> >> platforms. Griffin creates a unified process to define and construct
> >> data quality measurement pipeline across multiple data systems to
> >> provide:
> >>
> >> Automatic quality validation of the data
> >> Data profiling and anomaly detection
> >> Data quality lineage from upstream to downstream data systems.
> >> Data quality health monitoring visualization
> >> Shared infrastructure resource management
> >>
> >> Overview of Griffin
> >>
> >> Griffin has been deployed in production at eBay serving major data
> >> systems, it takes a platform approach to provide generic features to
> >> solve common data quality validation pain points. Firstly, user can
> >> register the data asset which user wants to do data quality check. The
> >> data asset can be batch data in RDBMS (e.g.Teradata), Apache Hadoop
> >> system or near real-time streaming data from Apache Kafka, Apache
> >> Storm and other real time data platforms. Secondly, user can create
> >> data quality model to define the data quality rule and metadata.
> >> Thirdly, the model or rule will be executed automatically (by the
> >> model engine) to get the sample data quality validation results in a
> >> few seconds for streaming data. Finally, user can analyze the data
> >> quality results through built-in visualization tool to take actions.
> >>
> >> Griffin includes:
> >>
> >> Data Quality Model Engine
> >>
> >> Griffin is model driven solution, user can choose various data quality
> >> dimension to execute his/her data quality validation based on selected
> >> target data-set or source data-set ( as the golden reference data). It
> >> has a corresponding library supporting it in back-end for the
> >> following measurement:
> >>
> >> Accuracy - Does data reflect the real-world objects or a verifiable
> source
> >> Completeness - Is all necessary data present
> >> Validity - Are all data values within the data domains specified by the
> >> business
> >> Timeliness - Is the data available at the time needed
> >> Anomaly detection - Pre-built algorithm functions for the
> >> identification of items, events or observations which do not conform
> >> to an expected pattern or other items in a dataset
> >> Data Profiling - Apply statistical analysis and assessment of data
> >> values within a dataset for consistency, uniqueness and logic.
> >>
> >> Data Collection Layer
> >>
> >> We support two kinds of data sources, batch data and real time data.
> >>
> >> For batch mode, we can collect data source from Apache Hadoop based
> >> platform by various data connectors.
> >>
> >> For real time mode, we can connect with messaging system like Kafka to
> >> near real time analysis.
> >>
> >> 

Who goes to the FOSDEM

2016-12-01 Thread Raphael Bircher

Hi at all

Sorry for the crossposting, but I wonder who pans to attend the FOSDEM 
at Bruxel? It's maybe just a possibility to meetup. Mainly after the 
programm at the evening.


Regards Raphael


-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [VOTE] Bring Griffin to Apache Incubator

2016-12-01 Thread Julian Hyde
+1 (binding)

> On Dec 1, 2016, at 3:30 PM, Liang Chen  wrote:
> 
> Hi
> 
> +1(non-binding)
> 
> Regards
> Liang
> 
> Henry Saputra wrote
>> Hi All,
>> 
>> As the champion for Griffin, I would like to start VOTE to bring  the
>> project as Apache incubator podling.
>> 
>> Here is the direct quote from the abstract:
>> 
>> "
>> Griffin is a Data Quality Service platform built on Apache Hadoop and
>> Apache Spark. It provides a framework process for defining data
>> quality model, executing data quality measurement, automating data
>> profiling and validation, as well as a unified data quality
>> visualization across multiple data systems. It tries to address the
>> data quality challenges in big data and streaming context.
>> "
>> 
>> Please cast your vote:
>> 
>> [ ] +1, bring Griffin into Incubator
>> [ ] +0, I don't care either way,
>> [ ] -1, do not bring Griffin into Incubator, because...
>> 
>> This vote will be open at least for 72 hours and only votes from the
>> Incubator PMC are binding.
>> 
>> The VOTE will end 12/5 9am PST to pass through weekend.
>> 
>> 
>> Here is the link to the proposal:
>> 
>> https://wiki.apache.org/incubator/GriffinProposal
>> 
>> I have copied the proposal below for easy access
>> 
>> 
>> Thanks,
>> 
>> - Henry
>> 
>> 
>> 
>> Griffin Proposal
>> 
>> Abstract
>> 
>> Griffin is a Data Quality Service platform built on Apache Hadoop and
>> Apache Spark. It provides a framework process for defining data
>> quality model, executing data quality measurement, automating data
>> profiling and validation, as well as a unified data quality
>> visualization across multiple data systems. It tries to address the
>> data quality challenges in big data and streaming context.
>> 
>> Proposal
>> 
>> Griffin is a open source Data Quality solution for distributed data
>> systems at any scale in both streaming or batch data context. When
>> people use open source products (e.g. Apache Hadoop, Apache Spark,
>> Apache Kafka, Apache Storm), they always need a data quality service
>> to build his/her confidence on data quality processed by those
>> platforms. Griffin creates a unified process to define and construct
>> data quality measurement pipeline across multiple data systems to
>> provide:
>> 
>> Automatic quality validation of the data
>> Data profiling and anomaly detection
>> Data quality lineage from upstream to downstream data systems.
>> Data quality health monitoring visualization
>> Shared infrastructure resource management
>> 
>> Overview of Griffin
>> 
>> Griffin has been deployed in production at eBay serving major data
>> systems, it takes a platform approach to provide generic features to
>> solve common data quality validation pain points. Firstly, user can
>> register the data asset which user wants to do data quality check. The
>> data asset can be batch data in RDBMS (e.g.Teradata), Apache Hadoop
>> system or near real-time streaming data from Apache Kafka, Apache
>> Storm and other real time data platforms. Secondly, user can create
>> data quality model to define the data quality rule and metadata.
>> Thirdly, the model or rule will be executed automatically (by the
>> model engine) to get the sample data quality validation results in a
>> few seconds for streaming data. Finally, user can analyze the data
>> quality results through built-in visualization tool to take actions.
>> 
>> Griffin includes:
>> 
>> Data Quality Model Engine
>> 
>> Griffin is model driven solution, user can choose various data quality
>> dimension to execute his/her data quality validation based on selected
>> target data-set or source data-set ( as the golden reference data). It
>> has a corresponding library supporting it in back-end for the
>> following measurement:
>> 
>> Accuracy - Does data reflect the real-world objects or a verifiable source
>> Completeness - Is all necessary data present
>> Validity - Are all data values within the data domains specified by the
>> business
>> Timeliness - Is the data available at the time needed
>> Anomaly detection - Pre-built algorithm functions for the
>> identification of items, events or observations which do not conform
>> to an expected pattern or other items in a dataset
>> Data Profiling - Apply statistical analysis and assessment of data
>> values within a dataset for consistency, uniqueness and logic.
>> 
>> Data Collection Layer
>> 
>> We support two kinds of data sources, batch data and real time data.
>> 
>> For batch mode, we can collect data source from Apache Hadoop based
>> platform by various data connectors.
>> 
>> For real time mode, we can connect with messaging system like Kafka to
>> near real time analysis.
>> 
>> Data Process and Storage Layer
>> 
>> For batch analysis, our data quality model will compute data quality
>> metrics in our spark cluster based on data source in Apache Hadoop.
>> 
>> For near real time analysis, we consume data from messaging system,
>> then our data quality model will 

Re: [VOTE] Bring Griffin to Apache Incubator

2016-12-01 Thread Liang Chen
Hi

+1(non-binding)

Regards
Liang

Henry Saputra wrote
> Hi All,
> 
> As the champion for Griffin, I would like to start VOTE to bring  the
> project as Apache incubator podling.
> 
> Here is the direct quote from the abstract:
> 
> "
> Griffin is a Data Quality Service platform built on Apache Hadoop and
> Apache Spark. It provides a framework process for defining data
> quality model, executing data quality measurement, automating data
> profiling and validation, as well as a unified data quality
> visualization across multiple data systems. It tries to address the
> data quality challenges in big data and streaming context.
> "
> 
> Please cast your vote:
> 
> [ ] +1, bring Griffin into Incubator
> [ ] +0, I don't care either way,
> [ ] -1, do not bring Griffin into Incubator, because...
> 
> This vote will be open at least for 72 hours and only votes from the
> Incubator PMC are binding.
> 
> The VOTE will end 12/5 9am PST to pass through weekend.
> 
> 
> Here is the link to the proposal:
> 
> https://wiki.apache.org/incubator/GriffinProposal
> 
> I have copied the proposal below for easy access
> 
> 
> Thanks,
> 
> - Henry
> 
> 
> 
> Griffin Proposal
> 
> Abstract
> 
> Griffin is a Data Quality Service platform built on Apache Hadoop and
> Apache Spark. It provides a framework process for defining data
> quality model, executing data quality measurement, automating data
> profiling and validation, as well as a unified data quality
> visualization across multiple data systems. It tries to address the
> data quality challenges in big data and streaming context.
> 
> Proposal
> 
> Griffin is a open source Data Quality solution for distributed data
> systems at any scale in both streaming or batch data context. When
> people use open source products (e.g. Apache Hadoop, Apache Spark,
> Apache Kafka, Apache Storm), they always need a data quality service
> to build his/her confidence on data quality processed by those
> platforms. Griffin creates a unified process to define and construct
> data quality measurement pipeline across multiple data systems to
> provide:
> 
> Automatic quality validation of the data
> Data profiling and anomaly detection
> Data quality lineage from upstream to downstream data systems.
> Data quality health monitoring visualization
> Shared infrastructure resource management
> 
> Overview of Griffin
> 
> Griffin has been deployed in production at eBay serving major data
> systems, it takes a platform approach to provide generic features to
> solve common data quality validation pain points. Firstly, user can
> register the data asset which user wants to do data quality check. The
> data asset can be batch data in RDBMS (e.g.Teradata), Apache Hadoop
> system or near real-time streaming data from Apache Kafka, Apache
> Storm and other real time data platforms. Secondly, user can create
> data quality model to define the data quality rule and metadata.
> Thirdly, the model or rule will be executed automatically (by the
> model engine) to get the sample data quality validation results in a
> few seconds for streaming data. Finally, user can analyze the data
> quality results through built-in visualization tool to take actions.
> 
> Griffin includes:
> 
> Data Quality Model Engine
> 
> Griffin is model driven solution, user can choose various data quality
> dimension to execute his/her data quality validation based on selected
> target data-set or source data-set ( as the golden reference data). It
> has a corresponding library supporting it in back-end for the
> following measurement:
> 
> Accuracy - Does data reflect the real-world objects or a verifiable source
> Completeness - Is all necessary data present
> Validity - Are all data values within the data domains specified by the
> business
> Timeliness - Is the data available at the time needed
> Anomaly detection - Pre-built algorithm functions for the
> identification of items, events or observations which do not conform
> to an expected pattern or other items in a dataset
> Data Profiling - Apply statistical analysis and assessment of data
> values within a dataset for consistency, uniqueness and logic.
> 
> Data Collection Layer
> 
> We support two kinds of data sources, batch data and real time data.
> 
> For batch mode, we can collect data source from Apache Hadoop based
> platform by various data connectors.
> 
> For real time mode, we can connect with messaging system like Kafka to
> near real time analysis.
> 
> Data Process and Storage Layer
> 
> For batch analysis, our data quality model will compute data quality
> metrics in our spark cluster based on data source in Apache Hadoop.
> 
> For near real time analysis, we consume data from messaging system,
> then our data quality model will compute our real time data quality
> metrics in our spark cluster. for data storage, we use time series
> database in our back end to fulfill front end request.
> 
> Griffin Service
> 
> We have RESTful web services to accomplish all 

Re: [VOTE] Bring Griffin to Apache Incubator

2016-12-01 Thread Luciano Resende
+1 (binding)

On Wed, Nov 30, 2016 at 10:40 PM Henry Saputra 
wrote:

> Hi All,
>
> As the champion for Griffin, I would like to start VOTE to bring  the
> project as Apache incubator podling.
>
> Here is the direct quote from the abstract:
>
> "
> Griffin is a Data Quality Service platform built on Apache Hadoop and
> Apache Spark. It provides a framework process for defining data
> quality model, executing data quality measurement, automating data
> profiling and validation, as well as a unified data quality
> visualization across multiple data systems. It tries to address the
> data quality challenges in big data and streaming context.
> "
>
> Please cast your vote:
>
> [ ] +1, bring Griffin into Incubator
> [ ] +0, I don't care either way,
> [ ] -1, do not bring Griffin into Incubator, because...
>
> This vote will be open at least for 72 hours and only votes from the
> Incubator PMC are binding.
>
> The VOTE will end 12/5 9am PST to pass through weekend.
>
>
> Here is the link to the proposal:
>
> https://wiki.apache.org/incubator/GriffinProposal
>
> I have copied the proposal below for easy access
>
>
> Thanks,
>
> - Henry
>
>
>
> Griffin Proposal
>
> Abstract
>
> Griffin is a Data Quality Service platform built on Apache Hadoop and
> Apache Spark. It provides a framework process for defining data
> quality model, executing data quality measurement, automating data
> profiling and validation, as well as a unified data quality
> visualization across multiple data systems. It tries to address the
> data quality challenges in big data and streaming context.
>
> Proposal
>
> Griffin is a open source Data Quality solution for distributed data
> systems at any scale in both streaming or batch data context. When
> people use open source products (e.g. Apache Hadoop, Apache Spark,
> Apache Kafka, Apache Storm), they always need a data quality service
> to build his/her confidence on data quality processed by those
> platforms. Griffin creates a unified process to define and construct
> data quality measurement pipeline across multiple data systems to
> provide:
>
> Automatic quality validation of the data
> Data profiling and anomaly detection
> Data quality lineage from upstream to downstream data systems.
> Data quality health monitoring visualization
> Shared infrastructure resource management
>
> Overview of Griffin
>
> Griffin has been deployed in production at eBay serving major data
> systems, it takes a platform approach to provide generic features to
> solve common data quality validation pain points. Firstly, user can
> register the data asset which user wants to do data quality check. The
> data asset can be batch data in RDBMS (e.g.Teradata), Apache Hadoop
> system or near real-time streaming data from Apache Kafka, Apache
> Storm and other real time data platforms. Secondly, user can create
> data quality model to define the data quality rule and metadata.
> Thirdly, the model or rule will be executed automatically (by the
> model engine) to get the sample data quality validation results in a
> few seconds for streaming data. Finally, user can analyze the data
> quality results through built-in visualization tool to take actions.
>
> Griffin includes:
>
> Data Quality Model Engine
>
> Griffin is model driven solution, user can choose various data quality
> dimension to execute his/her data quality validation based on selected
> target data-set or source data-set ( as the golden reference data). It
> has a corresponding library supporting it in back-end for the
> following measurement:
>
> Accuracy - Does data reflect the real-world objects or a verifiable source
> Completeness - Is all necessary data present
> Validity - Are all data values within the data domains specified by the
> business
> Timeliness - Is the data available at the time needed
> Anomaly detection - Pre-built algorithm functions for the
> identification of items, events or observations which do not conform
> to an expected pattern or other items in a dataset
> Data Profiling - Apply statistical analysis and assessment of data
> values within a dataset for consistency, uniqueness and logic.
>
> Data Collection Layer
>
> We support two kinds of data sources, batch data and real time data.
>
> For batch mode, we can collect data source from Apache Hadoop based
> platform by various data connectors.
>
> For real time mode, we can connect with messaging system like Kafka to
> near real time analysis.
>
> Data Process and Storage Layer
>
> For batch analysis, our data quality model will compute data quality
> metrics in our spark cluster based on data source in Apache Hadoop.
>
> For near real time analysis, we consume data from messaging system,
> then our data quality model will compute our real time data quality
> metrics in our spark cluster. for data storage, we use time series
> database in our back end to fulfill front end request.
>
> Griffin Service
>
> We have RESTful web services to accomplish 

Re: [VOTE] Bring Griffin to Apache Incubator

2016-12-01 Thread Gangumalla, Uma
+1 (binding)

Regards,
Uma

On 11/30/16, 10:40 PM, "Henry Saputra"  wrote:

>Hi All,
>
>As the champion for Griffin, I would like to start VOTE to bring  the
>project as Apache incubator podling.
>
>Here is the direct quote from the abstract:
>
>"
>Griffin is a Data Quality Service platform built on Apache Hadoop and
>Apache Spark. It provides a framework process for defining data
>quality model, executing data quality measurement, automating data
>profiling and validation, as well as a unified data quality
>visualization across multiple data systems. It tries to address the
>data quality challenges in big data and streaming context.
>"
>
>Please cast your vote:
>
>[ ] +1, bring Griffin into Incubator
>[ ] +0, I don't care either way,
>[ ] -1, do not bring Griffin into Incubator, because...
>
>This vote will be open at least for 72 hours and only votes from the
>Incubator PMC are binding.
>
>The VOTE will end 12/5 9am PST to pass through weekend.
>
>
>Here is the link to the proposal:
>
>https://wiki.apache.org/incubator/GriffinProposal
>
>I have copied the proposal below for easy access
>
>
>Thanks,
>
>- Henry
>
>
>
>Griffin Proposal
>
>Abstract
>
>Griffin is a Data Quality Service platform built on Apache Hadoop and
>Apache Spark. It provides a framework process for defining data
>quality model, executing data quality measurement, automating data
>profiling and validation, as well as a unified data quality
>visualization across multiple data systems. It tries to address the
>data quality challenges in big data and streaming context.
>
>Proposal
>
>Griffin is a open source Data Quality solution for distributed data
>systems at any scale in both streaming or batch data context. When
>people use open source products (e.g. Apache Hadoop, Apache Spark,
>Apache Kafka, Apache Storm), they always need a data quality service
>to build his/her confidence on data quality processed by those
>platforms. Griffin creates a unified process to define and construct
>data quality measurement pipeline across multiple data systems to
>provide:
>
>Automatic quality validation of the data
>Data profiling and anomaly detection
>Data quality lineage from upstream to downstream data systems.
>Data quality health monitoring visualization
>Shared infrastructure resource management
>
>Overview of Griffin
>
>Griffin has been deployed in production at eBay serving major data
>systems, it takes a platform approach to provide generic features to
>solve common data quality validation pain points. Firstly, user can
>register the data asset which user wants to do data quality check. The
>data asset can be batch data in RDBMS (e.g.Teradata), Apache Hadoop
>system or near real-time streaming data from Apache Kafka, Apache
>Storm and other real time data platforms. Secondly, user can create
>data quality model to define the data quality rule and metadata.
>Thirdly, the model or rule will be executed automatically (by the
>model engine) to get the sample data quality validation results in a
>few seconds for streaming data. Finally, user can analyze the data
>quality results through built-in visualization tool to take actions.
>
>Griffin includes:
>
>Data Quality Model Engine
>
>Griffin is model driven solution, user can choose various data quality
>dimension to execute his/her data quality validation based on selected
>target data-set or source data-set ( as the golden reference data). It
>has a corresponding library supporting it in back-end for the
>following measurement:
>
>Accuracy - Does data reflect the real-world objects or a verifiable source
>Completeness - Is all necessary data present
>Validity - Are all data values within the data domains specified by the
>business
>Timeliness - Is the data available at the time needed
>Anomaly detection - Pre-built algorithm functions for the
>identification of items, events or observations which do not conform
>to an expected pattern or other items in a dataset
>Data Profiling - Apply statistical analysis and assessment of data
>values within a dataset for consistency, uniqueness and logic.
>
>Data Collection Layer
>
>We support two kinds of data sources, batch data and real time data.
>
>For batch mode, we can collect data source from Apache Hadoop based
>platform by various data connectors.
>
>For real time mode, we can connect with messaging system like Kafka to
>near real time analysis.
>
>Data Process and Storage Layer
>
>For batch analysis, our data quality model will compute data quality
>metrics in our spark cluster based on data source in Apache Hadoop.
>
>For near real time analysis, we consume data from messaging system,
>then our data quality model will compute our real time data quality
>metrics in our spark cluster. for data storage, we use time series
>database in our back end to fulfill front end request.
>
>Griffin Service
>
>We have RESTful web services to accomplish all the functionalities of
>Griffin, such as register data asset, create data quality 

Re: [VOTE] Bring Griffin to Apache Incubator

2016-12-01 Thread Jacques Nadeau
+1 (binding)

On Thu, Dec 1, 2016 at 8:47 AM, Andrew Purtell 
wrote:

> +1 (binding)
>
> > On Dec 1, 2016, at 8:35 AM, Felix Cheung  wrote:
> >
> > +1
> >
> > On Wed, Nov 30, 2016 at 10:40 PM Henry Saputra 
> > wrote:
> >
> >> Hi All,
> >>
> >> As the champion for Griffin, I would like to start VOTE to bring  the
> >> project as Apache incubator podling.
> >>
> >> Here is the direct quote from the abstract:
> >>
> >> "
> >> Griffin is a Data Quality Service platform built on Apache Hadoop and
> >> Apache Spark. It provides a framework process for defining data
> >> quality model, executing data quality measurement, automating data
> >> profiling and validation, as well as a unified data quality
> >> visualization across multiple data systems. It tries to address the
> >> data quality challenges in big data and streaming context.
> >> "
> >>
> >> Please cast your vote:
> >>
> >> [ ] +1, bring Griffin into Incubator
> >> [ ] +0, I don't care either way,
> >> [ ] -1, do not bring Griffin into Incubator, because...
> >>
> >> This vote will be open at least for 72 hours and only votes from the
> >> Incubator PMC are binding.
> >>
> >> The VOTE will end 12/5 9am PST to pass through weekend.
> >>
> >>
> >> Here is the link to the proposal:
> >>
> >> https://wiki.apache.org/incubator/GriffinProposal
> >>
> >> I have copied the proposal below for easy access
> >>
> >>
> >> Thanks,
> >>
> >> - Henry
> >>
> >>
> >>
> >> Griffin Proposal
> >>
> >> Abstract
> >>
> >> Griffin is a Data Quality Service platform built on Apache Hadoop and
> >> Apache Spark. It provides a framework process for defining data
> >> quality model, executing data quality measurement, automating data
> >> profiling and validation, as well as a unified data quality
> >> visualization across multiple data systems. It tries to address the
> >> data quality challenges in big data and streaming context.
> >>
> >> Proposal
> >>
> >> Griffin is a open source Data Quality solution for distributed data
> >> systems at any scale in both streaming or batch data context. When
> >> people use open source products (e.g. Apache Hadoop, Apache Spark,
> >> Apache Kafka, Apache Storm), they always need a data quality service
> >> to build his/her confidence on data quality processed by those
> >> platforms. Griffin creates a unified process to define and construct
> >> data quality measurement pipeline across multiple data systems to
> >> provide:
> >>
> >> Automatic quality validation of the data
> >> Data profiling and anomaly detection
> >> Data quality lineage from upstream to downstream data systems.
> >> Data quality health monitoring visualization
> >> Shared infrastructure resource management
> >>
> >> Overview of Griffin
> >>
> >> Griffin has been deployed in production at eBay serving major data
> >> systems, it takes a platform approach to provide generic features to
> >> solve common data quality validation pain points. Firstly, user can
> >> register the data asset which user wants to do data quality check. The
> >> data asset can be batch data in RDBMS (e.g.Teradata), Apache Hadoop
> >> system or near real-time streaming data from Apache Kafka, Apache
> >> Storm and other real time data platforms. Secondly, user can create
> >> data quality model to define the data quality rule and metadata.
> >> Thirdly, the model or rule will be executed automatically (by the
> >> model engine) to get the sample data quality validation results in a
> >> few seconds for streaming data. Finally, user can analyze the data
> >> quality results through built-in visualization tool to take actions.
> >>
> >> Griffin includes:
> >>
> >> Data Quality Model Engine
> >>
> >> Griffin is model driven solution, user can choose various data quality
> >> dimension to execute his/her data quality validation based on selected
> >> target data-set or source data-set ( as the golden reference data). It
> >> has a corresponding library supporting it in back-end for the
> >> following measurement:
> >>
> >> Accuracy - Does data reflect the real-world objects or a verifiable
> source
> >> Completeness - Is all necessary data present
> >> Validity - Are all data values within the data domains specified by the
> >> business
> >> Timeliness - Is the data available at the time needed
> >> Anomaly detection - Pre-built algorithm functions for the
> >> identification of items, events or observations which do not conform
> >> to an expected pattern or other items in a dataset
> >> Data Profiling - Apply statistical analysis and assessment of data
> >> values within a dataset for consistency, uniqueness and logic.
> >>
> >> Data Collection Layer
> >>
> >> We support two kinds of data sources, batch data and real time data.
> >>
> >> For batch mode, we can collect data source from Apache Hadoop based
> >> platform by various data connectors.
> >>
> >> For real time mode, we can connect with messaging system like Kafka to

Re: [VOTE] Bring Griffin to Apache Incubator

2016-12-01 Thread Andrew Purtell
+1 (binding)

> On Dec 1, 2016, at 8:35 AM, Felix Cheung  wrote:
> 
> +1
> 
> On Wed, Nov 30, 2016 at 10:40 PM Henry Saputra 
> wrote:
> 
>> Hi All,
>> 
>> As the champion for Griffin, I would like to start VOTE to bring  the
>> project as Apache incubator podling.
>> 
>> Here is the direct quote from the abstract:
>> 
>> "
>> Griffin is a Data Quality Service platform built on Apache Hadoop and
>> Apache Spark. It provides a framework process for defining data
>> quality model, executing data quality measurement, automating data
>> profiling and validation, as well as a unified data quality
>> visualization across multiple data systems. It tries to address the
>> data quality challenges in big data and streaming context.
>> "
>> 
>> Please cast your vote:
>> 
>> [ ] +1, bring Griffin into Incubator
>> [ ] +0, I don't care either way,
>> [ ] -1, do not bring Griffin into Incubator, because...
>> 
>> This vote will be open at least for 72 hours and only votes from the
>> Incubator PMC are binding.
>> 
>> The VOTE will end 12/5 9am PST to pass through weekend.
>> 
>> 
>> Here is the link to the proposal:
>> 
>> https://wiki.apache.org/incubator/GriffinProposal
>> 
>> I have copied the proposal below for easy access
>> 
>> 
>> Thanks,
>> 
>> - Henry
>> 
>> 
>> 
>> Griffin Proposal
>> 
>> Abstract
>> 
>> Griffin is a Data Quality Service platform built on Apache Hadoop and
>> Apache Spark. It provides a framework process for defining data
>> quality model, executing data quality measurement, automating data
>> profiling and validation, as well as a unified data quality
>> visualization across multiple data systems. It tries to address the
>> data quality challenges in big data and streaming context.
>> 
>> Proposal
>> 
>> Griffin is a open source Data Quality solution for distributed data
>> systems at any scale in both streaming or batch data context. When
>> people use open source products (e.g. Apache Hadoop, Apache Spark,
>> Apache Kafka, Apache Storm), they always need a data quality service
>> to build his/her confidence on data quality processed by those
>> platforms. Griffin creates a unified process to define and construct
>> data quality measurement pipeline across multiple data systems to
>> provide:
>> 
>> Automatic quality validation of the data
>> Data profiling and anomaly detection
>> Data quality lineage from upstream to downstream data systems.
>> Data quality health monitoring visualization
>> Shared infrastructure resource management
>> 
>> Overview of Griffin
>> 
>> Griffin has been deployed in production at eBay serving major data
>> systems, it takes a platform approach to provide generic features to
>> solve common data quality validation pain points. Firstly, user can
>> register the data asset which user wants to do data quality check. The
>> data asset can be batch data in RDBMS (e.g.Teradata), Apache Hadoop
>> system or near real-time streaming data from Apache Kafka, Apache
>> Storm and other real time data platforms. Secondly, user can create
>> data quality model to define the data quality rule and metadata.
>> Thirdly, the model or rule will be executed automatically (by the
>> model engine) to get the sample data quality validation results in a
>> few seconds for streaming data. Finally, user can analyze the data
>> quality results through built-in visualization tool to take actions.
>> 
>> Griffin includes:
>> 
>> Data Quality Model Engine
>> 
>> Griffin is model driven solution, user can choose various data quality
>> dimension to execute his/her data quality validation based on selected
>> target data-set or source data-set ( as the golden reference data). It
>> has a corresponding library supporting it in back-end for the
>> following measurement:
>> 
>> Accuracy - Does data reflect the real-world objects or a verifiable source
>> Completeness - Is all necessary data present
>> Validity - Are all data values within the data domains specified by the
>> business
>> Timeliness - Is the data available at the time needed
>> Anomaly detection - Pre-built algorithm functions for the
>> identification of items, events or observations which do not conform
>> to an expected pattern or other items in a dataset
>> Data Profiling - Apply statistical analysis and assessment of data
>> values within a dataset for consistency, uniqueness and logic.
>> 
>> Data Collection Layer
>> 
>> We support two kinds of data sources, batch data and real time data.
>> 
>> For batch mode, we can collect data source from Apache Hadoop based
>> platform by various data connectors.
>> 
>> For real time mode, we can connect with messaging system like Kafka to
>> near real time analysis.
>> 
>> Data Process and Storage Layer
>> 
>> For batch analysis, our data quality model will compute data quality
>> metrics in our spark cluster based on data source in Apache Hadoop.
>> 
>> For near real time analysis, we consume data from messaging system,
>> then our data 

Re: [VOTE] Bring Griffin to Apache Incubator

2016-12-01 Thread Felix Cheung
+1

On Wed, Nov 30, 2016 at 10:40 PM Henry Saputra 
wrote:

> Hi All,
>
> As the champion for Griffin, I would like to start VOTE to bring  the
> project as Apache incubator podling.
>
> Here is the direct quote from the abstract:
>
> "
> Griffin is a Data Quality Service platform built on Apache Hadoop and
> Apache Spark. It provides a framework process for defining data
> quality model, executing data quality measurement, automating data
> profiling and validation, as well as a unified data quality
> visualization across multiple data systems. It tries to address the
> data quality challenges in big data and streaming context.
> "
>
> Please cast your vote:
>
> [ ] +1, bring Griffin into Incubator
> [ ] +0, I don't care either way,
> [ ] -1, do not bring Griffin into Incubator, because...
>
> This vote will be open at least for 72 hours and only votes from the
> Incubator PMC are binding.
>
> The VOTE will end 12/5 9am PST to pass through weekend.
>
>
> Here is the link to the proposal:
>
> https://wiki.apache.org/incubator/GriffinProposal
>
> I have copied the proposal below for easy access
>
>
> Thanks,
>
> - Henry
>
>
>
> Griffin Proposal
>
> Abstract
>
> Griffin is a Data Quality Service platform built on Apache Hadoop and
> Apache Spark. It provides a framework process for defining data
> quality model, executing data quality measurement, automating data
> profiling and validation, as well as a unified data quality
> visualization across multiple data systems. It tries to address the
> data quality challenges in big data and streaming context.
>
> Proposal
>
> Griffin is a open source Data Quality solution for distributed data
> systems at any scale in both streaming or batch data context. When
> people use open source products (e.g. Apache Hadoop, Apache Spark,
> Apache Kafka, Apache Storm), they always need a data quality service
> to build his/her confidence on data quality processed by those
> platforms. Griffin creates a unified process to define and construct
> data quality measurement pipeline across multiple data systems to
> provide:
>
> Automatic quality validation of the data
> Data profiling and anomaly detection
> Data quality lineage from upstream to downstream data systems.
> Data quality health monitoring visualization
> Shared infrastructure resource management
>
> Overview of Griffin
>
> Griffin has been deployed in production at eBay serving major data
> systems, it takes a platform approach to provide generic features to
> solve common data quality validation pain points. Firstly, user can
> register the data asset which user wants to do data quality check. The
> data asset can be batch data in RDBMS (e.g.Teradata), Apache Hadoop
> system or near real-time streaming data from Apache Kafka, Apache
> Storm and other real time data platforms. Secondly, user can create
> data quality model to define the data quality rule and metadata.
> Thirdly, the model or rule will be executed automatically (by the
> model engine) to get the sample data quality validation results in a
> few seconds for streaming data. Finally, user can analyze the data
> quality results through built-in visualization tool to take actions.
>
> Griffin includes:
>
> Data Quality Model Engine
>
> Griffin is model driven solution, user can choose various data quality
> dimension to execute his/her data quality validation based on selected
> target data-set or source data-set ( as the golden reference data). It
> has a corresponding library supporting it in back-end for the
> following measurement:
>
> Accuracy - Does data reflect the real-world objects or a verifiable source
> Completeness - Is all necessary data present
> Validity - Are all data values within the data domains specified by the
> business
> Timeliness - Is the data available at the time needed
> Anomaly detection - Pre-built algorithm functions for the
> identification of items, events or observations which do not conform
> to an expected pattern or other items in a dataset
> Data Profiling - Apply statistical analysis and assessment of data
> values within a dataset for consistency, uniqueness and logic.
>
> Data Collection Layer
>
> We support two kinds of data sources, batch data and real time data.
>
> For batch mode, we can collect data source from Apache Hadoop based
> platform by various data connectors.
>
> For real time mode, we can connect with messaging system like Kafka to
> near real time analysis.
>
> Data Process and Storage Layer
>
> For batch analysis, our data quality model will compute data quality
> metrics in our spark cluster based on data source in Apache Hadoop.
>
> For near real time analysis, we consume data from messaging system,
> then our data quality model will compute our real time data quality
> metrics in our spark cluster. for data storage, we use time series
> database in our back end to fulfill front end request.
>
> Griffin Service
>
> We have RESTful web services to accomplish all the 

Re: [VOTE] Bring Griffin to Apache Incubator

2016-12-01 Thread Henry Saputra
Of course my Vote

+1 (binding)

On Wed, Nov 30, 2016 at 10:40 PM Henry Saputra 
wrote:

> Hi All,
>
> As the champion for Griffin, I would like to start VOTE to bring  the
> project as Apache incubator podling.
>
> Here is the direct quote from the abstract:
>
> "
> Griffin is a Data Quality Service platform built on Apache Hadoop and
> Apache Spark. It provides a framework process for defining data
> quality model, executing data quality measurement, automating data
> profiling and validation, as well as a unified data quality
> visualization across multiple data systems. It tries to address the
> data quality challenges in big data and streaming context.
> "
>
> Please cast your vote:
>
> [ ] +1, bring Griffin into Incubator
> [ ] +0, I don't care either way,
> [ ] -1, do not bring Griffin into Incubator, because...
>
> This vote will be open at least for 72 hours and only votes from the
> Incubator PMC are binding.
>
> The VOTE will end 12/5 9am PST to pass through weekend.
>
>
> Here is the link to the proposal:
>
> https://wiki.apache.org/incubator/GriffinProposal
>
> I have copied the proposal below for easy access
>
>
> Thanks,
>
> - Henry
>
>
>
> Griffin Proposal
>
> Abstract
>
> Griffin is a Data Quality Service platform built on Apache Hadoop and
> Apache Spark. It provides a framework process for defining data
> quality model, executing data quality measurement, automating data
> profiling and validation, as well as a unified data quality
> visualization across multiple data systems. It tries to address the
> data quality challenges in big data and streaming context.
>
> Proposal
>
> Griffin is a open source Data Quality solution for distributed data
> systems at any scale in both streaming or batch data context. When
> people use open source products (e.g. Apache Hadoop, Apache Spark,
> Apache Kafka, Apache Storm), they always need a data quality service
> to build his/her confidence on data quality processed by those
> platforms. Griffin creates a unified process to define and construct
> data quality measurement pipeline across multiple data systems to
> provide:
>
> Automatic quality validation of the data
> Data profiling and anomaly detection
> Data quality lineage from upstream to downstream data systems.
> Data quality health monitoring visualization
> Shared infrastructure resource management
>
> Overview of Griffin
>
> Griffin has been deployed in production at eBay serving major data
> systems, it takes a platform approach to provide generic features to
> solve common data quality validation pain points. Firstly, user can
> register the data asset which user wants to do data quality check. The
> data asset can be batch data in RDBMS (e.g.Teradata), Apache Hadoop
> system or near real-time streaming data from Apache Kafka, Apache
> Storm and other real time data platforms. Secondly, user can create
> data quality model to define the data quality rule and metadata.
> Thirdly, the model or rule will be executed automatically (by the
> model engine) to get the sample data quality validation results in a
> few seconds for streaming data. Finally, user can analyze the data
> quality results through built-in visualization tool to take actions.
>
> Griffin includes:
>
> Data Quality Model Engine
>
> Griffin is model driven solution, user can choose various data quality
> dimension to execute his/her data quality validation based on selected
> target data-set or source data-set ( as the golden reference data). It
> has a corresponding library supporting it in back-end for the
> following measurement:
>
> Accuracy - Does data reflect the real-world objects or a verifiable source
> Completeness - Is all necessary data present
> Validity - Are all data values within the data domains specified by the
> business
> Timeliness - Is the data available at the time needed
> Anomaly detection - Pre-built algorithm functions for the
> identification of items, events or observations which do not conform
> to an expected pattern or other items in a dataset
> Data Profiling - Apply statistical analysis and assessment of data
> values within a dataset for consistency, uniqueness and logic.
>
> Data Collection Layer
>
> We support two kinds of data sources, batch data and real time data.
>
> For batch mode, we can collect data source from Apache Hadoop based
> platform by various data connectors.
>
> For real time mode, we can connect with messaging system like Kafka to
> near real time analysis.
>
> Data Process and Storage Layer
>
> For batch analysis, our data quality model will compute data quality
> metrics in our spark cluster based on data source in Apache Hadoop.
>
> For near real time analysis, we consume data from messaging system,
> then our data quality model will compute our real time data quality
> metrics in our spark cluster. for data storage, we use time series
> database in our back end to fulfill front end request.
>
> Griffin Service
>
> We have RESTful web 

Re: [VOTE] Bring Griffin to Apache Incubator

2016-12-01 Thread Stian Soiland-Reyes
+1 (binding)

On 1 December 2016 at 06:40, Henry Saputra  wrote:
> Hi All,
>
> As the champion for Griffin, I would like to start VOTE to bring  the
> project as Apache incubator podling.
>
> Here is the direct quote from the abstract:
>
> "
> Griffin is a Data Quality Service platform built on Apache Hadoop and
> Apache Spark. It provides a framework process for defining data
> quality model, executing data quality measurement, automating data
> profiling and validation, as well as a unified data quality
> visualization across multiple data systems. It tries to address the
> data quality challenges in big data and streaming context.
> "
>
> Please cast your vote:
>
> [ ] +1, bring Griffin into Incubator
> [ ] +0, I don't care either way,
> [ ] -1, do not bring Griffin into Incubator, because...
>
> This vote will be open at least for 72 hours and only votes from the
> Incubator PMC are binding.
>
> The VOTE will end 12/5 9am PST to pass through weekend.
>
>
> Here is the link to the proposal:
>
> https://wiki.apache.org/incubator/GriffinProposal
>
> I have copied the proposal below for easy access
>
>
> Thanks,
>
> - Henry
>
>
>
> Griffin Proposal
>
> Abstract
>
> Griffin is a Data Quality Service platform built on Apache Hadoop and
> Apache Spark. It provides a framework process for defining data
> quality model, executing data quality measurement, automating data
> profiling and validation, as well as a unified data quality
> visualization across multiple data systems. It tries to address the
> data quality challenges in big data and streaming context.
>
> Proposal
>
> Griffin is a open source Data Quality solution for distributed data
> systems at any scale in both streaming or batch data context. When
> people use open source products (e.g. Apache Hadoop, Apache Spark,
> Apache Kafka, Apache Storm), they always need a data quality service
> to build his/her confidence on data quality processed by those
> platforms. Griffin creates a unified process to define and construct
> data quality measurement pipeline across multiple data systems to
> provide:
>
> Automatic quality validation of the data
> Data profiling and anomaly detection
> Data quality lineage from upstream to downstream data systems.
> Data quality health monitoring visualization
> Shared infrastructure resource management
>
> Overview of Griffin
>
> Griffin has been deployed in production at eBay serving major data
> systems, it takes a platform approach to provide generic features to
> solve common data quality validation pain points. Firstly, user can
> register the data asset which user wants to do data quality check. The
> data asset can be batch data in RDBMS (e.g.Teradata), Apache Hadoop
> system or near real-time streaming data from Apache Kafka, Apache
> Storm and other real time data platforms. Secondly, user can create
> data quality model to define the data quality rule and metadata.
> Thirdly, the model or rule will be executed automatically (by the
> model engine) to get the sample data quality validation results in a
> few seconds for streaming data. Finally, user can analyze the data
> quality results through built-in visualization tool to take actions.
>
> Griffin includes:
>
> Data Quality Model Engine
>
> Griffin is model driven solution, user can choose various data quality
> dimension to execute his/her data quality validation based on selected
> target data-set or source data-set ( as the golden reference data). It
> has a corresponding library supporting it in back-end for the
> following measurement:
>
> Accuracy - Does data reflect the real-world objects or a verifiable source
> Completeness - Is all necessary data present
> Validity - Are all data values within the data domains specified by the 
> business
> Timeliness - Is the data available at the time needed
> Anomaly detection - Pre-built algorithm functions for the
> identification of items, events or observations which do not conform
> to an expected pattern or other items in a dataset
> Data Profiling - Apply statistical analysis and assessment of data
> values within a dataset for consistency, uniqueness and logic.
>
> Data Collection Layer
>
> We support two kinds of data sources, batch data and real time data.
>
> For batch mode, we can collect data source from Apache Hadoop based
> platform by various data connectors.
>
> For real time mode, we can connect with messaging system like Kafka to
> near real time analysis.
>
> Data Process and Storage Layer
>
> For batch analysis, our data quality model will compute data quality
> metrics in our spark cluster based on data source in Apache Hadoop.
>
> For near real time analysis, we consume data from messaging system,
> then our data quality model will compute our real time data quality
> metrics in our spark cluster. for data storage, we use time series
> database in our back end to fulfill front end request.
>
> Griffin Service
>
> We have RESTful web services to accomplish all 

Re: [VOTE] Bring Griffin to Apache Incubator

2016-12-01 Thread wp chun
+1. Data quality is critical to the success of data driven business.  This is a 
good initiative to draw people attention on ensuring data quality.


From: Hao Chen 
Sent: December 1, 2016 6:01:15 AM
To: general@incubator.apache.org
Subject: Re: [VOTE] Bring Griffin to Apache Incubator

+1 (non-binding), Griffin looks like a very valuable project to fill the
challenging gap of data quality solution in big data domain.

- Hao

On Thu, Dec 1, 2016 at 2:40 PM, Henry Saputra 
wrote:

> Hi All,
>
> As the champion for Griffin, I would like to start VOTE to bring  the
> project as Apache incubator podling.
>
> Here is the direct quote from the abstract:
>
> "
> Griffin is a Data Quality Service platform built on Apache Hadoop and
> Apache Spark. It provides a framework process for defining data
> quality model, executing data quality measurement, automating data
> profiling and validation, as well as a unified data quality
> visualization across multiple data systems. It tries to address the
> data quality challenges in big data and streaming context.
> "
>
> Please cast your vote:
>
> [ ] +1, bring Griffin into Incubator
> [ ] +0, I don't care either way,
> [ ] -1, do not bring Griffin into Incubator, because...
>
> This vote will be open at least for 72 hours and only votes from the
> Incubator PMC are binding.
>
> The VOTE will end 12/5 9am PST to pass through weekend.
>
>
> Here is the link to the proposal:
>
> https://wiki.apache.org/incubator/GriffinProposal
>
> I have copied the proposal below for easy access
>
>
> Thanks,
>
> - Henry
>
>
>
> Griffin Proposal
>
> Abstract
>
> Griffin is a Data Quality Service platform built on Apache Hadoop and
> Apache Spark. It provides a framework process for defining data
> quality model, executing data quality measurement, automating data
> profiling and validation, as well as a unified data quality
> visualization across multiple data systems. It tries to address the
> data quality challenges in big data and streaming context.
>
> Proposal
>
> Griffin is a open source Data Quality solution for distributed data
> systems at any scale in both streaming or batch data context. When
> people use open source products (e.g. Apache Hadoop, Apache Spark,
> Apache Kafka, Apache Storm), they always need a data quality service
> to build his/her confidence on data quality processed by those
> platforms. Griffin creates a unified process to define and construct
> data quality measurement pipeline across multiple data systems to
> provide:
>
> Automatic quality validation of the data
> Data profiling and anomaly detection
> Data quality lineage from upstream to downstream data systems.
> Data quality health monitoring visualization
> Shared infrastructure resource management
>
> Overview of Griffin
>
> Griffin has been deployed in production at eBay serving major data
> systems, it takes a platform approach to provide generic features to
> solve common data quality validation pain points. Firstly, user can
> register the data asset which user wants to do data quality check. The
> data asset can be batch data in RDBMS (e.g.Teradata), Apache Hadoop
> system or near real-time streaming data from Apache Kafka, Apache
> Storm and other real time data platforms. Secondly, user can create
> data quality model to define the data quality rule and metadata.
> Thirdly, the model or rule will be executed automatically (by the
> model engine) to get the sample data quality validation results in a
> few seconds for streaming data. Finally, user can analyze the data
> quality results through built-in visualization tool to take actions.
>
> Griffin includes:
>
> Data Quality Model Engine
>
> Griffin is model driven solution, user can choose various data quality
> dimension to execute his/her data quality validation based on selected
> target data-set or source data-set ( as the golden reference data). It
> has a corresponding library supporting it in back-end for the
> following measurement:
>
> Accuracy - Does data reflect the real-world objects or a verifiable source
> Completeness - Is all necessary data present
> Validity - Are all data values within the data domains specified by the
> business
> Timeliness - Is the data available at the time needed
> Anomaly detection - Pre-built algorithm functions for the
> identification of items, events or observations which do not conform
> to an expected pattern or other items in a dataset
> Data Profiling - Apply statistical analysis and assessment of data
> values within a dataset for consistency, uniqueness and logic.
>
> Data Collection Layer
>
> We support two kinds of data sources, batch data and real time data.
>
> For batch mode, we can collect data source from Apache Hadoop based
> platform by various data connectors.
>
> For real time mode, we can connect with messaging system like Kafka to
> near real time analysis.
>
> Data Process and Storage Layer
>
> For batch 

Re: [VOTE] Bring Griffin to Apache Incubator

2016-12-01 Thread Hao Chen
+1 (non-binding), Griffin looks like a very valuable project to fill the
challenging gap of data quality solution in big data domain.

- Hao

On Thu, Dec 1, 2016 at 2:40 PM, Henry Saputra 
wrote:

> Hi All,
>
> As the champion for Griffin, I would like to start VOTE to bring  the
> project as Apache incubator podling.
>
> Here is the direct quote from the abstract:
>
> "
> Griffin is a Data Quality Service platform built on Apache Hadoop and
> Apache Spark. It provides a framework process for defining data
> quality model, executing data quality measurement, automating data
> profiling and validation, as well as a unified data quality
> visualization across multiple data systems. It tries to address the
> data quality challenges in big data and streaming context.
> "
>
> Please cast your vote:
>
> [ ] +1, bring Griffin into Incubator
> [ ] +0, I don't care either way,
> [ ] -1, do not bring Griffin into Incubator, because...
>
> This vote will be open at least for 72 hours and only votes from the
> Incubator PMC are binding.
>
> The VOTE will end 12/5 9am PST to pass through weekend.
>
>
> Here is the link to the proposal:
>
> https://wiki.apache.org/incubator/GriffinProposal
>
> I have copied the proposal below for easy access
>
>
> Thanks,
>
> - Henry
>
>
>
> Griffin Proposal
>
> Abstract
>
> Griffin is a Data Quality Service platform built on Apache Hadoop and
> Apache Spark. It provides a framework process for defining data
> quality model, executing data quality measurement, automating data
> profiling and validation, as well as a unified data quality
> visualization across multiple data systems. It tries to address the
> data quality challenges in big data and streaming context.
>
> Proposal
>
> Griffin is a open source Data Quality solution for distributed data
> systems at any scale in both streaming or batch data context. When
> people use open source products (e.g. Apache Hadoop, Apache Spark,
> Apache Kafka, Apache Storm), they always need a data quality service
> to build his/her confidence on data quality processed by those
> platforms. Griffin creates a unified process to define and construct
> data quality measurement pipeline across multiple data systems to
> provide:
>
> Automatic quality validation of the data
> Data profiling and anomaly detection
> Data quality lineage from upstream to downstream data systems.
> Data quality health monitoring visualization
> Shared infrastructure resource management
>
> Overview of Griffin
>
> Griffin has been deployed in production at eBay serving major data
> systems, it takes a platform approach to provide generic features to
> solve common data quality validation pain points. Firstly, user can
> register the data asset which user wants to do data quality check. The
> data asset can be batch data in RDBMS (e.g.Teradata), Apache Hadoop
> system or near real-time streaming data from Apache Kafka, Apache
> Storm and other real time data platforms. Secondly, user can create
> data quality model to define the data quality rule and metadata.
> Thirdly, the model or rule will be executed automatically (by the
> model engine) to get the sample data quality validation results in a
> few seconds for streaming data. Finally, user can analyze the data
> quality results through built-in visualization tool to take actions.
>
> Griffin includes:
>
> Data Quality Model Engine
>
> Griffin is model driven solution, user can choose various data quality
> dimension to execute his/her data quality validation based on selected
> target data-set or source data-set ( as the golden reference data). It
> has a corresponding library supporting it in back-end for the
> following measurement:
>
> Accuracy - Does data reflect the real-world objects or a verifiable source
> Completeness - Is all necessary data present
> Validity - Are all data values within the data domains specified by the
> business
> Timeliness - Is the data available at the time needed
> Anomaly detection - Pre-built algorithm functions for the
> identification of items, events or observations which do not conform
> to an expected pattern or other items in a dataset
> Data Profiling - Apply statistical analysis and assessment of data
> values within a dataset for consistency, uniqueness and logic.
>
> Data Collection Layer
>
> We support two kinds of data sources, batch data and real time data.
>
> For batch mode, we can collect data source from Apache Hadoop based
> platform by various data connectors.
>
> For real time mode, we can connect with messaging system like Kafka to
> near real time analysis.
>
> Data Process and Storage Layer
>
> For batch analysis, our data quality model will compute data quality
> metrics in our spark cluster based on data source in Apache Hadoop.
>
> For near real time analysis, we consume data from messaging system,
> then our data quality model will compute our real time data quality
> metrics in our spark cluster. for data storage, we use time series