Hi all,

Recent one year, I've been working on implementing a rule-based data
quality solution, and it works well in our environment.

I want to discuss with you the possibility to move apache griffin to a
rule-based solution.

I want to replace the measure module with purley SQL-Based measurements.

The workflow is like this,

Users register their dataset into some register center like (resourcename,
link, connector).

Griffin provides SQL-Based measurements and connects different data sources
as apache presto or apache spark does.

The new model will cover all existing use cases, since we will provide many
SQL based rules to cover all dimensions like accuracy, completeness,
consistency, uniqueness, timeliness. And even more,  users can extend
domain rules by providing more recording/checking SQL based rules.

Users define the data quality requirements with their parameters based on
griffin SQL-Based measurements.

Users can fire an event to ask Griffin to trigger a data quality check,
typically the event fired when the ETL task finished.

Griffin dispatch data quality recording SQL to query engines like apache
spark or apache presto.

Based on the check recording result, Griffin fires an alert if the
recording result is not expected as defined in rule.

Griffin can store those recording results into some TSDB.

Of course, we need to make sure our next generation griffin is backward
compatible.

For example,

Say I want to compare consistency between two datasets TA and TB, if the
Count(tb)/Count(ta) < 95%, I need griffin send an alarm to
[email protected]

The workflow is like below,

>From the system rules, I select the consistency cross-count rule. I read
the specification of this rule. And the rule requires three parameters,
{sourceTable, targetTable, threshold}, so I provide it as {sourceTable=TA,
targetTable=TA, threshold=0.95}

Every time when my ETL task is done, some system will send a message to
griffin to check my consistency cross-count data quality. The message is
like {ruleinstance=100, partition='20211222'}

Upon received the message, by leveraging the metadata of ruleinstance 100,
Griffin will send two recording SQLs
Select(1) from TA WHERE dt='20211222'
Select(1) from TB WHERE dt='20211222'
to compute engine.

When the compute engine get these two recording results, say
Metrics_count_ta= 100, Metrcis_count_tb=94

Griffin will trigger the checking phase , the checking result = 94/100 =
0.94, the result will persist into TSDB.

Since 0.94 < 0.95(user defined threshold), griffin will send an email to
[email protected],




Thanks,
William

Reply via email to