airajena opened a new pull request, #7:
URL: https://github.com/apache/fineract-business-intelligence/pull/7

   # Add Python ELT Extractor with Local Development Infrastructure
   
   ## Summary
   
   This PR introduces the foundational **Extractor Module** for the Apache 
Fineract Business Intelligence project. It provides a Python-based ELT pipeline 
that extracts data from a Fineract PostgreSQL source database and loads it into 
an analytics warehouse, supporting both full backfill and incremental (delta) 
extraction modes.
   
   ## What's New
   
   ### Python ELT Extractor (`extractor/`)
   
   - **`extractor.py`** → Core extraction engine with batch streaming, bulk 
UPSERTs, and transaction management
   - **`cli.py`** → CLI entry point with `backfill` and `incremental` 
subcommands
   - **`config.py`** → Environment-based configuration (`AppConfig`, 
`DatabaseConfig`)
   - **`watermark_manager.py`** → Per-table incremental extraction state 
tracking via `meta.watermarks`
   - **`replica_lag_check.py`** → PostgreSQL replica lag safety gate using 
`pg_last_xact_replay_timestamp()`
   - **`requirements.txt`** → `psycopg2-binary==2.9.9`
   
   **Key Features:**
   - **Backfill Mode**: Full historical reload (clears raw tables + watermarks, 
extracts all rows)
   - **Incremental Mode**: Watermark-based delta extraction with configurable 
lookback window
   - **COB Gate**: Validates recent Close-of-Business batch completion before 
extraction
   - **Batch Processing**: Streams data in configurable batches (default 1000 
rows) with `execute_values` bulk upserts
   - **Pipeline Tracking**: All runs audited in `meta.pipeline_state` with row 
counts and status
   - **SQL Injection Prevention**: All dynamic identifiers use 
`psycopg2.sql.Identifier()`
   
   **11 Source Tables Extracted:** `m_office`, `m_currency`, `m_client`, 
`m_product_loan`, `m_loan`, `m_loan_transaction`, `m_delinquency_range`, 
`m_delinquency_bucket`, `m_delinquency_bucket_mappings`, 
`m_loan_delinquency_tag_history`, `batch_job_execution`
   
   ### Docker Compose Infrastructure (`compose.yaml`)
   
   | Service | Image | Port | Purpose |
   |---------|-------|------|---------|
   | `fineract-db` | `postgres:18.3` | 5433 | Fineract source database |
   | `warehouse` | `postgres:16-alpine` | 5434 | Analytics warehouse |
   | `extractor` | `python:3.11-slim` (custom) | — | Python ETL container |
   
   ### Warehouse Schema (`warehouse/schema/`)
   
   - **`raw_tables.sql`** → 11 raw tables with composite PK `(tenant_id, id)` 
and `source_loaded_at` audit column
   - **`pipeline_state.sql`** → `meta.pipeline_state`, `meta.watermarks`, 
`meta.user_office_mapping`
   - **`staging_tables.sql`** → Placeholder schemas for future dbt 
transformations
   - **`mart_tables.sql`** → Analytics schema and indexes
   
   ### Development Scripts (`scripts/`)
   
   | Script | Purpose |
   |--------|---------|
   | `bootstrap_fineract_source.sh` | Starts Fineract backend, creates 
compatibility views, grants reader access |
   | `run_extractor_backfill.sh` | Runs extractor in backfill mode |
   | `run_extractor_incremental.sh` | Runs extractor in incremental mode |
   | `verify_extraction.sh` | Validates service health, schemas, row counts, 
pipeline state, watermarks |
   | `stop_fineract_backend.sh` | Stops the Fineract Java process |
   | `common.sh` | Shared utilities (Docker checks, env loading, path 
conversion, health polling) |
   
   ## Quick Start
   
   ```bash
   # 1. Configure environment
   cp .env.example .env
   
   # 2. Start infrastructure
   docker compose up -d fineract-db warehouse
   
   # 3. Bootstrap Fineract source
   ./scripts/bootstrap_fineract_source.sh
   
   # 4. Run extraction
   ./scripts/run_extractor_backfill.sh
   
   # 5. Verify
   ./scripts/verify_extraction.sh
   ```
   
   ## Testing
   
   End-to-end test completed successfully:
   - Bootstrap: Fineract backend started, compatibility views created, reader 
access granted
   - Backfill: 174 rows extracted across 11 tables
   - Verification: 15/15 checks passed (service health, schemas, row counts, 
pipeline state, watermarks)
   
   ## Known Limitations (Future PRs)
   
   - dbt integration for staging/intermediate transformations
   - Apache Superset for dashboards
   - Multi-tenant extraction (schema is ready, not yet tested)
   - Production remote Fineract support
   
   ## Files Added
   
   ```
   compose.yaml
   .env.example
   docker/extractor/Dockerfile
   docker/fineract-postgresql/initdb/001_init_fineract_databases.sh
   docker/postgres-warehouse/initdb/001_init_warehouse.sql
   docker/postgres-warehouse/initdb/002_create_warehouse_roles.sh
   extractor/__init__.py
   extractor/cli.py
   extractor/config.py
   extractor/extractor.py
   extractor/replica_lag_check.py
   extractor/requirements.txt
   extractor/watermark_manager.py
   scripts/bootstrap_fineract_source.sh
   scripts/common.sh
   scripts/run_extractor_backfill.sh
   scripts/run_extractor_incremental.sh
   scripts/stop_fineract_backend.sh
   scripts/verify_extraction.sh
   warehouse/schema/mart_tables.sql
   warehouse/schema/pipeline_state.sql
   warehouse/schema/raw_tables.sql
   warehouse/schema/staging_tables.sql
   ```
   
   ## Checklist
   
   - [x] All files include Apache 2.0 license headers (verified via local RAT 
scan)
   - [x] Shell script syntax validated (`bash -n`)
   - [x] Python syntax validated (AST parse)
   - [x] YAML syntax validated
   - [x] End-to-end extraction test passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to