GitHub user eugenegujing added a comment to the discussion: Task ideas for the 
dkNet-AI · Apache Texera Agent Hackathon

# DataGuard: Claude-Code-style Permissioned AI for Data Cleaning

**Themes:** Human-Agent Collaboration · Data / AI for Science

**The gap.** Texera can run cleaning logic, but data-cleaning *decisions* today 
are made one of two ways: silently by an auto-clean script (no audit trail, no 
human control), or manually in a notebook (no reproducibility, doesn't scale). 
Neither works for scientific data, where placeholder values, group-imbalanced 
missingness, and "outliers that are actually meaningful rare cases" need a 
domain expert in the loop.

**The idea.** Bring the Claude Code permission model into Texera, but applied 
to *data* instead of code. A conversational `DataGuard` agent profiles a 
dataset, proposes one fix at a time, and asks the user to **Allow / Allow & 
remember / Deny / Modify** before any mutating action. Every approved action is 
logged with evidence and confidence, producing an auditable, replayable 
decision trail.

**Why it fits the Agent Hackathon**

- **No engine changes.** Lives entirely in `agent-service` + chat panel; 
doesn't require Amber pause/resume, no new operators, no new protocol — the 
hardest path for HITL workflows is sidestepped by going chat-first.
- **Targets domain-expert users** (bio, medicine, social science) who consume 
data but aren't developers — addressed at the data-decision layer rather than 
the code layer.
- **Reproducible by design.** `--replay decision_log.csv` rebuilds the cleaned 
dataset without re-invoking the LLM. Concrete, measurable correctness.
- **Aligned with the Data / AI for Science theme** — every cleaning decision is 
explained, justified, and reversible.


GitHub link: 
https://github.com/apache/texera/discussions/5059#discussioncomment-16924731

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to