andreahlert opened a new pull request, #231: URL: https://github.com/apache/airflow-steward/pull/231
## What Fixes two critical correctness bugs in the `privacy-llm/redactor` package, surfaced by a code review. ### 1. `--field` help text lists type names the parser rejects `redact.py` advertised `reporter` and code `R` in the `--field` help. Neither exists. The valid types come from `TYPE_CODES` in `mapping.py`: `name/email/phone/ip/handle/address`, codes `N/E/P/IP/H/A`. A user copying the documented form (`--field reporter:Jane Smith` or `--field R:Jane Smith`) got a `SystemExit` and their PII flowed to the LLM unredacted. Help text now lists the real names and codes. ### 2. `load_mapping` reads with the locale-default encoding `load_mapping` used `path.read_text()` with no `encoding`, which resolves to `locale.getpreferredencoding(False)` (e.g. `cp1252` on Windows). But `save_mapping_atomic` writes the file as UTF-8. On any non-UTF-8 host this corrupts non-ASCII PII values (accented names, IDN domains) on the save/load round-trip, so `pii-reveal` substitutes the wrong text. `load_mapping` now reads with `encoding="utf-8"`, matching the writer. ## Changes - `src/redactor/redact.py` — `--field` help text: `reporter/R` → `name/N` - `src/redactor/mapping.py` — `load_mapping` reads `encoding="utf-8"` - `tests/test_redact.py` — regression test on the `--help` output - `tests/test_mapping.py` — regression test for a non-ASCII round-trip ## Validation - `pytest`: 53 passed - `ruff check` / `ruff format` / `mypy`: clean - `prek`: all hooks pass -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
