airajena opened a new pull request, #7: URL: https://github.com/apache/fineract-business-intelligence/pull/7
# Add Python ELT Extractor with Local Development Infrastructure ## Summary This PR introduces the foundational **Extractor Module** for the Apache Fineract Business Intelligence project. It provides a Python-based ELT pipeline that extracts data from a Fineract PostgreSQL source database and loads it into an analytics warehouse, supporting both full backfill and incremental (delta) extraction modes. ## What's New ### Python ELT Extractor (`extractor/`) - **`extractor.py`** → Core extraction engine with batch streaming, bulk UPSERTs, and transaction management - **`cli.py`** → CLI entry point with `backfill` and `incremental` subcommands - **`config.py`** → Environment-based configuration (`AppConfig`, `DatabaseConfig`) - **`watermark_manager.py`** → Per-table incremental extraction state tracking via `meta.watermarks` - **`replica_lag_check.py`** → PostgreSQL replica lag safety gate using `pg_last_xact_replay_timestamp()` - **`requirements.txt`** → `psycopg2-binary==2.9.9` **Key Features:** - **Backfill Mode**: Full historical reload (clears raw tables + watermarks, extracts all rows) - **Incremental Mode**: Watermark-based delta extraction with configurable lookback window - **COB Gate**: Validates recent Close-of-Business batch completion before extraction - **Batch Processing**: Streams data in configurable batches (default 1000 rows) with `execute_values` bulk upserts - **Pipeline Tracking**: All runs audited in `meta.pipeline_state` with row counts and status - **SQL Injection Prevention**: All dynamic identifiers use `psycopg2.sql.Identifier()` **11 Source Tables Extracted:** `m_office`, `m_currency`, `m_client`, `m_product_loan`, `m_loan`, `m_loan_transaction`, `m_delinquency_range`, `m_delinquency_bucket`, `m_delinquency_bucket_mappings`, `m_loan_delinquency_tag_history`, `batch_job_execution` ### Docker Compose Infrastructure (`compose.yaml`) | Service | Image | Port | Purpose | |---------|-------|------|---------| | `fineract-db` | `postgres:18.3` | 5433 | Fineract source database | | `warehouse` | `postgres:16-alpine` | 5434 | Analytics warehouse | | `extractor` | `python:3.11-slim` (custom) | — | Python ETL container | ### Warehouse Schema (`warehouse/schema/`) - **`raw_tables.sql`** → 11 raw tables with composite PK `(tenant_id, id)` and `source_loaded_at` audit column - **`pipeline_state.sql`** → `meta.pipeline_state`, `meta.watermarks`, `meta.user_office_mapping` - **`staging_tables.sql`** → Placeholder schemas for future dbt transformations - **`mart_tables.sql`** → Analytics schema and indexes ### Development Scripts (`scripts/`) | Script | Purpose | |--------|---------| | `bootstrap_fineract_source.sh` | Starts Fineract backend, creates compatibility views, grants reader access | | `run_extractor_backfill.sh` | Runs extractor in backfill mode | | `run_extractor_incremental.sh` | Runs extractor in incremental mode | | `verify_extraction.sh` | Validates service health, schemas, row counts, pipeline state, watermarks | | `stop_fineract_backend.sh` | Stops the Fineract Java process | | `common.sh` | Shared utilities (Docker checks, env loading, path conversion, health polling) | ## Quick Start ```bash # 1. Configure environment cp .env.example .env # 2. Start infrastructure docker compose up -d fineract-db warehouse # 3. Bootstrap Fineract source ./scripts/bootstrap_fineract_source.sh # 4. Run extraction ./scripts/run_extractor_backfill.sh # 5. Verify ./scripts/verify_extraction.sh ``` ## Testing End-to-end test completed successfully: - Bootstrap: Fineract backend started, compatibility views created, reader access granted - Backfill: 174 rows extracted across 11 tables - Verification: 15/15 checks passed (service health, schemas, row counts, pipeline state, watermarks) ## Known Limitations (Future PRs) - dbt integration for staging/intermediate transformations - Apache Superset for dashboards - Multi-tenant extraction (schema is ready, not yet tested) - Production remote Fineract support ## Files Added ``` compose.yaml .env.example docker/extractor/Dockerfile docker/fineract-postgresql/initdb/001_init_fineract_databases.sh docker/postgres-warehouse/initdb/001_init_warehouse.sql docker/postgres-warehouse/initdb/002_create_warehouse_roles.sh extractor/__init__.py extractor/cli.py extractor/config.py extractor/extractor.py extractor/replica_lag_check.py extractor/requirements.txt extractor/watermark_manager.py scripts/bootstrap_fineract_source.sh scripts/common.sh scripts/run_extractor_backfill.sh scripts/run_extractor_incremental.sh scripts/stop_fineract_backend.sh scripts/verify_extraction.sh warehouse/schema/mart_tables.sql warehouse/schema/pipeline_state.sql warehouse/schema/raw_tables.sql warehouse/schema/staging_tables.sql ``` ## Checklist - [x] All files include Apache 2.0 license headers (verified via local RAT scan) - [x] Shell script syntax validated (`bash -n`) - [x] Python syntax validated (AST parse) - [x] YAML syntax validated - [x] End-to-end extraction test passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
