Hi all, Recent one year, I've been working on implementing a rule-based data quality solution, and it works well in our environment.
I want to discuss with you the possibility to move apache griffin to a rule-based solution. I want to replace the measure module with purley SQL-Based measurements. The workflow is like this, Users register their dataset into some register center like (resourcename, link, connector). Griffin provides SQL-Based measurements and connects different data sources as apache presto or apache spark does. The new model will cover all existing use cases, since we will provide many SQL based rules to cover all dimensions like accuracy, completeness, consistency, uniqueness, timeliness. And even more, users can extend domain rules by providing more recording/checking SQL based rules. Users define the data quality requirements with their parameters based on griffin SQL-Based measurements. Users can fire an event to ask Griffin to trigger a data quality check, typically the event fired when the ETL task finished. Griffin dispatch data quality recording SQL to query engines like apache spark or apache presto. Based on the check recording result, Griffin fires an alert if the recording result is not expected as defined in rule. Griffin can store those recording results into some TSDB. Of course, we need to make sure our next generation griffin is backward compatible. For example, Say I want to compare consistency between two datasets TA and TB, if the Count(tb)/Count(ta) < 95%, I need griffin send an alarm to [email protected] The workflow is like below, >From the system rules, I select the consistency cross-count rule. I read the specification of this rule. And the rule requires three parameters, {sourceTable, targetTable, threshold}, so I provide it as {sourceTable=TA, targetTable=TA, threshold=0.95} Every time when my ETL task is done, some system will send a message to griffin to check my consistency cross-count data quality. The message is like {ruleinstance=100, partition='20211222'} Upon received the message, by leveraging the metadata of ruleinstance 100, Griffin will send two recording SQLs Select(1) from TA WHERE dt='20211222' Select(1) from TB WHERE dt='20211222' to compute engine. When the compute engine get these two recording results, say Metrics_count_ta= 100, Metrcis_count_tb=94 Griffin will trigger the checking phase , the checking result = 94/100 = 0.94, the result will persist into TSDB. Since 0.94 < 0.95(user defined threshold), griffin will send an email to [email protected], Thanks, William
