airajena opened a new pull request, #14:
URL: https://github.com/apache/fineract-business-intelligence/pull/14
# Summary
This PR introduces the complete dbt transformation layer for the Apache
Fineract Business Intelligence project. It transforms raw Fineract loan data
loaded by the Python ELT extractor into analytics-ready tables that power the
**Delinquency and Portfolio at Risk (PAR) Dashboard**.
The transformation layer follows a 4-layer architecture: **raw → staging →
intermediate → mart**, producing metrics segmented by branch and loan product.
---
## Architecture
```
PostgreSQL Warehouse (analytics DB)
│
├── raw.* ← Python ELT extractor writes here (11 tables)
│
├── staging.* ← dbt views: rename columns, handle PII, filter noise
│
├── intermediate.* ← ephemeral: date spine × active loans × delinquency
events
│
└── analytics.* ← dbt tables: dimensions, facts, mart (queried by
Superset)
├── dim_client
├── dim_office
├── dim_product
├── dim_date
├── dim_delinquency_bucket
├── dim_currency
├── fact_loan_snapshot ← incremental, one row per loan per day
├── fact_delinquency_event ← one row per delinquency event
└── mart_delinquency_par ← PAR 30/60/90 ratios, NPA, bucket
distribution
```
---
## Files Added
### Configuration
| File | Description |
|------|-------------|
| `dbt/dbt_project.yml` | Project config. Defines schema routing
(`staging.*`, `analytics.*`), materialization defaults (views for staging,
incremental for facts, tables for marts), and runtime vars
(`historical_start_date`, `snapshot_incremental_lookback_days`). |
| `dbt/profiles.yml` | PostgreSQL connection via environment variables
(`PGHOST`, `PGPORT`, `PGUSER`, `PGPASSWORD`, `PGDATABASE`). No credentials
hardcoded. |
| `dbt/requirements.txt` | `dbt-postgres>=1.7,<2.0` Apache 2.0 licensed. |
### Macros
| File | Description |
|------|-------------|
| `dbt/macros/generate_schema_name.sql` | Overrides dbt default that
prefixes target schema onto custom schema names. Without this, `+schema:
staging` would create `analytics_staging` instead of `staging`. |
| `dbt/macros/safe_divide.sql` | NULL/zero-safe division macro. Used in all
PAR ratio calculations to return NULL instead of a runtime division-by-zero
error when portfolio is empty. |
### Source Declaration
| File | Description |
|------|-------------|
| `dbt/models/sources.yml` | Declares all 11 raw tables as dbt sources with
freshness rules (warn after 24h, error after 72h). Enables `{{ source('raw',
'table_name') }}` references and `dbt source freshness` checks. |
### Staging Layer (views)
| File | What it does |
|------|-------------|
| `dbt/models/staging/stg_m_loan.sql` | Renames `id → loan_id`. Passes
through all balance columns including `principal_outstanding_derived`,
`is_npa`. No filtering staging is neutral. |
| `dbt/models/staging/stg_m_client.sql` | **PII handling**: derives
`client_hash = md5(tenant_id \|\| '::' \|\| id)` using built-in `md5()` (no
pgcrypto extension). Derives `age_band` from `date_of_birth`. **Excludes
`date_of_birth`** from output. |
| `dbt/models/staging/stg_m_loan_transaction.sql` | **Filters `WHERE NOT
is_reversed`** reversed transactions are cancelled payments that must not be
double-counted. COALESCEs all `_portion_derived` columns to 0. |
| `dbt/models/staging/stg_m_delinquency.sql` | Joins 4 raw tables into a
single denormalised delinquency event view:
`raw_m_loan_delinquency_tag_history` + `raw_m_delinquency_range` +
`raw_m_delinquency_bucket_mappings` + `raw_m_delinquency_bucket`. |
| `dbt/models/staging/stg_m_office.sql` | Renames `id → office_id`, `name →
office_name`, `parent_id → parent_office_id`. |
| `dbt/models/staging/stg_m_product_loan.sql` | Renames `id → product_id`,
`name → product_name`. Passes through loan product configuration. |
| `dbt/models/staging/stg_m_currency.sql` | Renames `code → currency_code`,
`name → currency_name`. |
| `dbt/models/staging/_staging.yml` | Column-level tests: `not_null`,
`unique`, `accepted_values` for loan status codes, transaction type codes, age
bands. |
### Intermediate Layer (ephemeral — no table created)
| File | What it does |
|------|-------------|
| `dbt/models/intermediate/int_loan_delinquency_status.sql` | **Core PAR
logic.** Creates one row per active loan per calendar day. Algorithm: (1)
select all active loans (`loan_status_id = 300`) with `disbursedon_date IS NOT
NULL`; (2) generate date spine from `historical_start_date` to `current_date`;
(3) cross-join loan × date where date is within loan's active period; (4) LEFT
JOIN delinquency events that were active on each date; (5) use `ROW_NUMBER()`
to pick the most recent event per loan-day; (6) derive all PAR flags and
`standard_par_band`. Materialized as `ephemeral` so it is inlined as a CTE
inside `fact_loan_snapshot` at compile time. |
### Dimension Models (tables rebuilt on each run)
| File | What it does |
|------|-------------|
| `dbt/models/marts/dimensions/dim_client.sql` | Pseudonymised client
dimension. No PII. `client_hash` is the surrogate key. |
| `dbt/models/marts/dimensions/dim_office.sql` | Branch hierarchy. Derives
`office_level` from Fineract's dot-separated hierarchy string (e.g. `.1.2.` →
level 2). |
| `dbt/models/marts/dimensions/dim_product.sql` | Loan product configuration
including interest rates, repayment frequency, NPA threshold. |
| `dbt/models/marts/dimensions/dim_date.sql` | Full date spine 2010-01-01 to
2035-12-31 with year, quarter, month, week, day-of-week, weekend flag,
first/last day of month, first day of quarter. |
| `dbt/models/marts/dimensions/dim_delinquency_bucket.sql` | Maps configured
Fineract delinquency buckets to standard PAR bands. Always includes synthetic
`bucket_key = 0` for `Current` (non-delinquent) loans. |
| `dbt/models/marts/dimensions/dim_currency.sql` | ISO 4217 currency
reference. |
### Fact Models (incremental tables)
| File | What it does |
|------|-------------|
| `dbt/models/marts/facts/fact_loan_snapshot.sql` | **Primary fact table.**
One row per active loan per day. Incremental with `delete+insert`, 120-day
lookback. `snapshot_key = md5(tenant_id::loan_id::snapshot_date)`. Contains PAR
flags (`is_par_30`, `is_par_60`, `is_par_90`, `is_watch_list`, `is_npa`) and
`standard_par_band`. |
| `dbt/models/marts/facts/fact_delinquency_event.sql` | One row per
delinquency classification event on a loan. Captures `previous_bucket_key` and
`previous_standard_par_band` via `LAG()` window function for the **delinquency
migration matrix**. Tracks `event_duration_days` and `is_active_event`. |
### Mart (presentation table)
| File | What it does |
|------|-------------|
| `dbt/models/marts/presentations/mart_delinquency_par.sql` | **Dashboard
mart.** Aggregates `fact_loan_snapshot` by `snapshot_date`, `office_id`,
`product_id`. Produces two row types per group: (1) `bucket_key = -1` ("All
Portfolio") rollup row for KPI cards; (2) per-bucket rows for distribution
charts. Contains all PAR metrics, NPA metrics, exclusive and cumulative band
amounts, and ratios. |
| `dbt/models/marts/_marts.yml` | FK relationship tests, uniqueness tests,
PAR ratio bounds test. |
---
## Screenshots
### 1. All schemas and tables created
The complete warehouse schema is initialised raw landing zone (11 tables),
analytics layer (9 tables), and metadata layer (3 tables). Staging is correctly
materialised as views (not tables).
<img width="764" height="562" alt="image"
src="https://github.com/user-attachments/assets/f2cf2d53-3351-4a58-b76c-21514fb9028a"
/>
---
### 2. ELT pipeline ran successfully
The Python extractor completed a full backfill, loading 1,268 rows across
all 11 source tables with `status = success`. The `meta.pipeline_state` table
tracks every run with timestamps and row counts.
<img width="1653" height="181" alt="image"
src="https://github.com/user-attachments/assets/6e6d34b4-f4e6-4f4d-9dad-69f8e6c1eaee"
/>
---
### 3. PAR ratios by office and product
`mart_delinquency_par` correctly aggregates PAR 30/60/90 ratios segmented by
branch and loan product . South Branch shows 100% PAR 30 (all 3 loans
delinquent) and 80.4% NPA.
<img width="1584" height="179" alt="image"
src="https://github.com/user-attachments/assets/7395eb09-5798-421e-afc0-005e048db988"
/>
---
### 4. All 6 delinquency scenarios classified correctly
`fact_loan_snapshot` correctly classifies all 13 active loans into their
exact PAR band on today's date. Every scenario is present: Current, Watch-list,
PAR 30-59, PAR 60-89, PAR 90+ (normal), PAR 90+ (NPA), PAR 90+ (extreme/180+).
Loans 13 (written-off) and 14 (closed) are correctly excluded.
<img width="1639" height="358" alt="image"
src="https://github.com/user-attachments/assets/2b96b057-5eb0-4a3f-ae9d-9a8c26f5df30"
/>
---
### 5. Delinquency bucket distribution
`mart_delinquency_par` correctly produces per-bucket rows showing how the
portfolio is distributed across delinquency bands. Uses the
`dim_delinquency_bucket` dimension joined into the mart. Used to power the
bucket distribution chart in Superset.
<img width="833" height="123" alt="image"
src="https://github.com/user-attachments/assets/7f970a40-c8b0-4ec3-a7c2-bc56c2e84597"
/>
---
### 6. Delinquency migration matrix (PAR 30 → PAR 60 → PAR 90)
`fact_delinquency_event` captures the complete delinquency history of a
loan with `previous_standard_par_band` derived via `LAG()`. Loan 10 shows the
full journey: entered PAR 30 on 2025-11-29 (61 days), escalated to PAR 60 on
2026-01-29 (30 days), escalated to PAR 90+ on 2026-02-28 (still active, 94+
days).
<img width="1270" height="140" alt="image"
src="https://github.com/user-attachments/assets/db30383c-008e-42f1-aab0-08ff8ef82b7e"
/>
---
### 7. Recovered loan : delinquency lifted, now Current
The date-spine logic correctly handles loan recovery. Loan 15 was in PAR 30
from 2026-01-15 to 2026-03-01 (`is_active_event = false`). On `CURRENT_DATE` it
correctly shows `standard_par_band = 'Current'` and `is_par_30 = false` the
loan is no longer delinquent.
<img width="1048" height="358" alt="image"
src="https://github.com/user-attachments/assets/eacc82ab-cfc7-4a87-81d7-a621f3328d04"
/>
---
### 8. Reversed transaction filter in staging
`stg_m_loan_transaction` correctly applies `WHERE NOT is_reversed`. The raw
table has 76 transactions (including 1 reversed payment). Staging exposes only
75 — the reversed transaction is excluded from all downstream calculations.
<img width="400" height="114" alt="image"
src="https://github.com/user-attachments/assets/f4e4b1c2-75aa-418f-b978-2c1727c6048b"
/>
---
### 9. PAR trend over time date spine is working
The date spine in `int_loan_delinquency_status` correctly generates one
snapshot per loan per day. The mart aggregation produces a daily row going back
610 days to the earliest loan disbursement .
<img width="930" height="388" alt="image"
src="https://github.com/user-attachments/assets/0d4abcf9-cfb5-4431-992b-f9bc28bffed9"
/>
---
### 10. PII handling : date_of_birth removed, client pseudonymised
The staging layer correctly removes `date_of_birth` from the analytics layer
and replaces it with `age_band` (bucketed). The `client_id` is replaced with
`client_hash = md5(tenant_id || '::' || id)`. Raw data contains PII; analytics
contains none.
<img width="1077" height="327" alt="image"
src="https://github.com/user-attachments/assets/5ec235da-4d93-4c20-81fe-8942e1bd94f4"
/>
---
### 11. dbt tests : all 57 pass
All column-level data quality tests pass: primary key uniqueness, not-null
constraints, accepted values for enum columns, and FK integrity between fact
and dimension tables.
<img width="1404" height="867" alt="image"
src="https://github.com/user-attachments/assets/308faf3c-3960-42fb-950a-657624345506"
/>
---
### 12. dbt run all 16 models built
The full transformation DAG builds successfully in the correct dependency
order. 7 staging views → 6 dimension tables → 2 fact tables → 1 mart.
<img width="1460" height="772" alt="image"
src="https://github.com/user-attachments/assets/78fcec7e-d7f5-4876-9877-9405ee9fefab"
/>
---
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]