GitHub user eugenegujing added a comment to the discussion: Task ideas for the dkNet-AI · Apache Texera Agent Hackathon
# DataGuard: Claude-Code-style Permissioned AI for Data Cleaning **Themes:** Human-Agent Collaboration · Data / AI for Science **The gap.** Texera can run cleaning logic, but data-cleaning *decisions* today are made one of two ways: silently by an auto-clean script (no audit trail, no human control), or manually in a notebook (no reproducibility, doesn't scale). Neither works for scientific data, where placeholder values, group-imbalanced missingness, and "outliers that are actually meaningful rare cases" need a domain expert in the loop. **The idea.** Bring the Claude Code permission model into Texera, but applied to *data* instead of code. A conversational `DataGuard` agent profiles a dataset, proposes one fix at a time, and asks the user to **Allow / Allow & remember / Deny / Modify** before any mutating action. Every approved action is logged with evidence and confidence, producing an auditable, replayable decision trail. **Why it fits the Agent Hackathon** - **No engine changes.** Lives entirely in `agent-service` + chat panel; doesn't require Amber pause/resume, no new operators, no new protocol — the hardest path for HITL workflows is sidestepped by going chat-first. - **Targets domain-expert users** (bio, medicine, social science) who consume data but aren't developers — addressed at the data-decision layer rather than the code layer. - **Reproducible by design.** `--replay decision_log.csv` rebuilds the cleaned dataset without re-invoking the LLM. Concrete, measurable correctness. - **Aligned with the Data / AI for Science theme** — every cleaning decision is explained, justified, and reversible. GitHub link: https://github.com/apache/texera/discussions/5059#discussioncomment-16924731 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
