(griffin) branch master updated: Griffin 2.0.0 arch (#654)

guoyp Wed, 03 Jul 2024 02:15:22 -0700

This is an automated email from the ASF dual-hosted git repository.

guoyp pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/griffin.git



The following commit(s) were added to refs/heads/master by this push:
     new e293406f Griffin 2.0.0 arch (#654)
e293406f is described below

commit e293406f5756a9d375a1e123f32dbbdd72934130
Author: William Guo <[email protected]>
AuthorDate: Wed Jul 3 17:15:13 2024 +0800

    Griffin 2.0.0 arch (#654)
    
    * new proposal for data quality tool
    
    * repolish
    
    * enterprise job scheduler might retry or stop the downstream scheduler 
based on standardized result
    
    * illustrate data quality etl phase, append after business phase
    
    * elaborate integrate with business workflow
    
    * init DQDiagrams
    
    * add metric storage service in DQDiagrams
    
    * triggered on demand
    
    * add more metrics
    
    * update arch diagram
    
    * init two table diff result set
    
    * data platform upgrades data quality checking pipeline
    
    * typo
    
    * update it as data quality constrains
---
 griffin-doc/DQDiagrams.md          | 101 +++++++++++++++++++++
 griffin-doc/DataQualityTool.md     | 177 +++++++++++++++++++++++++++++++++++++
 griffin-doc/TwoTablesDiffResult.md |  16 ++++
 griffin-doc/arch2.png              | Bin 0 -> 71521 bytes
 4 files changed, 294 insertions(+)

diff --git a/griffin-doc/DQDiagrams.md b/griffin-doc/DQDiagrams.md
new file mode 100644
index 00000000..c964f25d
--- /dev/null
+++ b/griffin-doc/DQDiagrams.md
@@ -0,0 +1,101 @@
+# DQ Diagrams
+## Arch
+![img.png](arch2.png)
+
+## Entities
+
+### DQMetric
+> Represents a generic data quality metric used to assess various aspects of 
data quality (quantitative).
+
+- **DQCompletenessMetric**
+  > Measures the completeness of data, ensuring that all required data is 
present.
+
+    - **DQCOUNTMetric**
+      > A specific completeness metric that counts the number of non-missing 
values in a dataset.
+    - **DQNULLPERCENTAGEMetric**
+      > A specific completeness metric that counts the percentage of null 
values in a dataset.
+
+- **DQAccuracyMetric**
+  > Measures the accuracy of data, ensuring that data values are correct and 
conform to a known standard.
+
+    - **DQNULLMetric**
+      > An accuracy metric that counts the number of NULL values in a dataset.
+
+- **DQPROFILEMetric**
+  > Measures the profile of data, such as max, min, median, avg, stddev
+
+    - **DQMAXMetric**
+      > A profile metric that max of values in a dataset.
+
+    - **DQMINMetric**
+      > A profile metric that min of values in a dataset.
+
+    - **DQMEDIANMetric**
+      > A profile metric that median of values in a dataset.
+
+    - **DQAVGMetric**
+      > A profile metric that average of values in a dataset.
+
+    - **DQSTDDEVMetric**
+      > A profile metric that standard deviation of values in a dataset.
+
+    - **DQTOPKMetric**
+      > A profile metric that list top k frequent items of values in a dataset.
+
+
+- **DQUniquenessMetric**
+  > Measures the uniqueness of data, ensuring that there are no duplicate 
records.
+
+    - **DQDISTINCTCOUNTMetric**
+      > A specific uniqueness metric that identifies and counts unique records 
in a dataset.
+
+- **DQFreshnessMetric**
+  > Measures the freshness of data, ensuring that the data is up-to-date.
+
+    - **DQTTUMetric (Time to Usable)**
+      > A freshness metric that measures the time taken for data to become 
usable after it is created or updated.
+
+- **DQDiffMetric**
+  > Compares data across different datasets or points in time to identify 
discrepancies.
+
+    - **DQTableDiffMetric**
+      > A specific diff metric that compares entire tables to identify 
differences.
+
+    - **DQFileDiffMetric**
+      > A specific diff metric that compares files to identify differences.
+
+- **MetricStorageService**
+  > A data quality metric storage and fetch service
+
+
+- **DQJob**
+  > Abstract Data Quality related Jobs.
+    - **MetricCollectingJob**
+      > A job that collects data quality metrics from various sources and 
stores them for analysis.
+ 
+    - **DQCheckJob**
+      > A job that performs data quality checks based on predefined rules and 
metrics.
+    
+    - **DQAlertJob**
+      > A job that generates alerts when data quality issues are detected.
+
+    - **DQDag**
+      > A directed acyclic graph that defines the dependencies and execution 
order of various data quality jobs.
+
+- **Scheduler**
+  > A system that schedules and manages the execution of data quality jobs. 
+  > This is the default scheduler, it will launch data quality jobs 
periodically.
+
+    - **DolphinSchdulerAdapter**
+      > Connects our planed data quality jobs with Apache Dolphinscheduler,
+      > allowing data quality jobs to be triggered upon the completion of 
dependent previous jobs.
+    - **AirflowSchdulerAdapter**
+      > Connects our planed data quality jobs with apache airflow,
+      > so that data quality jobs can be triggered upon the completion of 
dependent previous jobs.
+      >
+
+- **Worker**
+  > should we need another worker layer, since most work are done on big data 
side
+  > 
+> 
+
diff --git a/griffin-doc/DataQualityTool.md b/griffin-doc/DataQualityTool.md
new file mode 100644
index 00000000..1347000c
--- /dev/null
+++ b/griffin-doc/DataQualityTool.md
@@ -0,0 +1,177 @@
+# Data Quality Tool
+
+## Introduction
+
+In the evolving landscape of data architecture, ensuring data quality remains 
a critical success factor for all companies.
+Data architectures have progressed significantly over recent years, 
transitioning from relational databases and data
+warehouses to data lakes, hybrid data lake and warehouse combinations, and 
modern lakehouses.
+
+Despite these advancements, data quality issues persist and have become 
increasingly vital, especially in the era of AI
+and data integration. Improving data quality is essential for all 
organizations, and maintaining it across various
+environments requires a combination of people, processes, and technology.
+
+To address these challenges, we will upgrade data quality tool designed to be 
easily adopted by any data organization.
+This tool abstracts common data quality problems and integrates seamlessly 
with diverse data architectures.
+
+## Data Quality Dimensions
+
+1. **Accuracy** – Data should be error-free by business needs.
+2. **Consistency** – Data should not conflict with other values across data 
sets.
+3. **Completeness** – Data should not be missing.
+4. **Timeliness** – Data should be up-to-date in a limited time frame
+5. **Uniqueness** – Data should have no duplicates.
+6. **Validity** – Data should conform to a specified format.
+
+## Our new Architecture
+
+Our new architecture consists of two primary layers: the Data Quality Layer 
and the Integration Layer.
+
+### Data Quality Constraints Layer
+
+This constraints layer abstracts the core concepts of the data quality 
lifecycle, focusing on:
+
+- **Defining Specific Data Quality Constraints**:
+  - **Metrics**: Establishing specific data quality metrics.
+  - **Anomaly Detection**: Implementing methods for detecting anomalies.
+  - **Actions**: Defining actions to be taken based on the data quality 
assessments.
+
+- **Measuring Data Quality**:
+  - Utilizing various connectors such as SQL, HTTP, and CMD to measure data 
quality across different systems.
+
+- **Unifying Data Quality Results**:
+  - Creating a standardized and structured view of data quality results across 
different dimensions to ensure a consistent understanding.
+
+- **Flexible Data Quality Jobs**:
+  - Designing data quality jobs within a generic, topological Directed Acyclic 
Graph (DAG) framework to facilitate easy plug-and-play functionality.
+
+### Integration Layer
+
+This layer provides a robust framework to enable users to integrate Griffin 
data quality pipelines seamlessly with their business processes. It includes:
+
+- **Scheduler Integration**:
+  - Ensuring seamless integration with typical schedulers for efficient 
pipeline execution.
+
+- **Apache DolphinScheduler Integration**:
+  - Facilitating effortless integration within the Java ecosystem to leverage 
Apache DolphinScheduler.
+
+- **Apache Airflow Integration**:
+  - Enabling smooth integration within the AI ecosystem using Apache Airflow.
+
+This architecture aims to provide a comprehensive and flexible approach to 
managing data quality
+and integrating it into various existing business workflows in data team.
+
+So that enterprise job scheduling system will launch optional data quality 
check pipelines after usual data jobs are finished.
+And maybe based on data quality result, schedule some actions such as retry or 
stop the downstream scheduling like circuit breaker.
+
+### Data Quality Layer
+
+#### Data Quality Constraints Definition
+
+This concept has been thoroughly discussed in the original Apache Griffin 
design documents. Essentially, we aim to quantify
+the data quality of a dataset based on the aforementioned dimensions. For 
example, to measure the count of records in a user
+table, our data quality constraint definition could be:
+
+**Simple Version:**
+
+- **Metric**
+  - Name: count_of_users
+  - Target: user_table
+  - Dimension: count
+- **Anomaly Condition:** $metric <= 0
+- **Post Action:** send alert
+
+**Advanced Version:**
+
+- **Metric**
+  - Name: count_of_users
+  - Target: user_table
+  - Filters: city = 'shanghai' and event_date = '20240601'
+  - Dimension: count
+- **Anomaly Condition:** $metric <= 0
+- **Post Action:** send alert
+
+#### Data Quality Pipelines(DAG)
+
+We support several typical data quality pipelines:
+
+**One Dataset Profiling Pipeline:**
+
+```plaintext
+recording_target_table_metric_job -> anomaly_condition_job -> post_action_job
+```
+
+**Dataset Diff Pipeline:**
+
+```plaintext
+recording_target_table1_metric_job  ->
+                                       \
+                                        -> anomaly_condition_job  -> 
post_action_job
+                                       /
+recording_target_table2_metric_job  ->
+```
+
+**Compute Platform Migration Pipeline:**
+
+```plaintext
+run_job_on_platform_v1 -> recording_target_table_metric_job_on_v1  ->
+                                                                       \
+                                                                        -> 
anomaly_condition_job  -> post_action_job
+                                                                       /
+run_job_on_platform_v2 -> recording_target_table_metric_job_on_v2  ->
+```
+#### Data Quality Report
+
+- **Meet Expectations**
+  + Data Quality Constrain 1: Passed
+  + Data Quality Constrain 2: Passed
+- **Does Not Meet Expectations**
+  + Data Quality Constrain 3: Failed
+    - Violation details
+    - Possible root cause 
+  + Data Quality Constrain 4: Failed
+    - Violation details
+    - Possible root cause
+
+#### Connectors
+
+The executor measures the data quality of the target dataset by recording the 
metrics. It supports many predefined protocols,
+and customers can extend the executor protocol if they want to add their own 
business logic.
+
+**Predefined Protocols:**
+
+- MySQL: 
`jdbc:mysql://hostname:port/database_name?user=username&password=password`
+- Presto: `jdbc:presto://hostname:port/catalog/schema`
+- Trino: `jdbc:trino://hostname:port/catalog/schema`
+- HTTP: `http://hostname:port/api/v1/query?query=<prometheus_query>`
+- Docker
+
+### Integration layer
+
+Every data team has its own existing scheduler.
+While we provide a default scheduler, for greater adoption, we will refactor
+our Apache Griffin scheduler capabilities to leverage our customers' 
schedulers.
+This involves redesigning our scheduler to either ingest job instances into 
our customers' schedulers
+or bridge our DQ pipelines to their DAGs.
+
+```plaintext
+  biz_etl_phase  ||     data_quality_phase
+                 ||
+business_etl_job -> recording_target_table1_metric_job  - ->
+                 ||                                         \
+                 ||                                           -> 
anomaly_condition_job  -> post_action_job
+                 ||                                         /
+business_etl_job -> recording_target_table2_metric_job  - ->
+                 ||
+```
+
+ - integration with a generic scheduler
+
+ - integration with apache dolphinscheduler
+
+ - integration with apache airflow
+
+
+
+
+
+
diff --git a/griffin-doc/TwoTablesDiffResult.md 
b/griffin-doc/TwoTablesDiffResult.md
new file mode 100644
index 00000000..6357a56d
--- /dev/null
+++ b/griffin-doc/TwoTablesDiffResult.md
@@ -0,0 +1,16 @@
+# Two tables diff result set
+We want to unify result set for two table comparing, when two tables' schema 
are the same,
+we can construct result set as below to let our users quickly find the 
difference between two tables.
+
+| diff_type  | col1_src     | col1_target | col2_src  | col2_target | col3_src 
 | col3_target | col4_src   | col4_target |
+|------------|--------------|-------------|-----------|-------------|-----------|-------------|------------|-------------|
+| missing    | prefix1      | NULL        | sug_vote1 | NULL        | 
pv_total1 | NULL        | 2024-01-01 | NULL        |
+| additional | NULL         | prefix1     | NULL      | sug_vote2   | NULL     
 | pv_total2   | NULL       | 2024-01-01  |
+| missing    | prefix3      | NULL        | sug_vote3 | NULL        | 
pv_total3 | NULL        | 2024-01-03 | NULL        |
+| additional | NULL         | prefix4     | NULL      | sug_vote4   | NULL     
 | pv_total3   | NULL       | 2024-01-03  |
+| missing    | prefix5      | NULL        | sug_vote5 | NULL        | 
pv_total5 | NULL        | 2024-01-05 | NULL        |
+| additional | NULL         | prefix5     | NULL      | sug_vote5   | NULL     
 | pv_total6   | NULL       | 2024-01-05  |
+| missing    | prefix7      | NULL        | sug_vote7 | NULL        | 
pv_total7 | NULL        | 2024-01-07 | NULL        |
+| additional | NULL         | prefix8     | NULL      | sug_vote8   | NULL     
 | pv_total8   | NULL       | 2024-01-07  |
+| missing    | prefix9      | NULL        | sug_vote9 | NULL        | 
pv_total9 | NULL        | 2024-01-09 | NULL        |
+| additional | NULL         | prefix10    | NULL      | sug_vote10  | NULL     
 | pv_total10  | NULL       | 2024-01-09  |
diff --git a/griffin-doc/arch2.png b/griffin-doc/arch2.png
new file mode 100644
index 00000000..cc871bfc
Binary files /dev/null and b/griffin-doc/arch2.png differ

(griffin) branch master updated: Griffin 2.0.0 arch (#654)

Reply via email to