Re: [PR] add Privacy-LLM gate-check validator [airflow-steward]

via GitHub Sun, 24 May 2026 19:09:28 -0700


Copilot commented on code in PR #215:
URL: https://github.com/apache/airflow-steward/pull/215#discussion_r3295764399



##########
tools/skill-validator/src/skill_validator/__init__.py:
##########
@@ -68,9 +68,47 @@
 REQUIRED_FRONTMATTER_KEYS = {"name", "description", "license"}
 OPTIONAL_FRONTMATTER_KEYS = {"when_to_use", "mode"}
 ALLOWED_LICENSES = {"Apache-2.0"}
-# MISSION mode taxonomy — see docs/modes.md.
-# "Auto-merge" deliberately excluded: it is off per MISSION sequencing.
-ALLOWED_MODES = {"Triage", "Mentoring", "Drafting", "Pairing"}
+
+
+def _read_mode_table() -> dict[str, str]:
+    """Read the canonical MISSION mode table from ``docs/modes.md``."""
+    starts = [Path.cwd().resolve(), Path(__file__).resolve().parent]
+    roots = []
+    for start in starts:
+        roots.extend([start, *start.parents])
+
+    for root in roots:
+        modes_doc = root / DOCS_DIR / "modes.md"
+        if not modes_doc.is_file():
+            continue
+        text = modes_doc.read_text(encoding="utf-8")
+        try:
+            modes_table = text.split("## Modes at a glance", 1)[1].split("## 
Triage", 1)[0]
+        except IndexError:
+            break
+        modes: dict[str, str] = {}
+        for line in modes_table.splitlines():
+            if not line.startswith("| **"):
+                continue
+            cells = [cell.strip() for cell in line.strip("|").split("|")]
+            if len(cells) < 3:
+                continue
+            mode = cells[0].strip("*")
+            status = cells[2].strip()
+            if mode and status:
+                modes[mode] = status
+        if modes:
+            return modes
+
+    msg = "could not read mode taxonomy from docs/modes.md"
+    raise RuntimeError(msg)
+
+
+# MISSION mode taxonomy — docs/modes.md is canonical.
+_MODE_STATUS_BY_NAME = _read_mode_table()
+_MODE_TAXONOMY = set(_MODE_STATUS_BY_NAME)
+_OFF_MODES = {mode for mode, status in _MODE_STATUS_BY_NAME.items() if status 
== "off"}
+ALLOWED_MODES = _MODE_TAXONOMY - _OFF_MODES
 

Review Comment:
   Reading `docs/modes.md` from disk at import time makes `import 
skill_validator` depend on repository layout and presence of `docs/` (and adds 
filesystem I/O during import). This is likely to break when the validator is 
installed as a package without docs, run from unusual working directories, or 
used as a library. Consider lazy-loading the table (e.g., compute on first use 
in `run_validation`), or shipping an embedded/fallback mode table and only 
asserting doc-consistency in tests/CI.
   



##########
tools/skill-validator/src/skill_validator/__init__.py:
##########
@@ -200,6 +240,36 @@
     ),
 ]
 
+# ---------------------------------------------------------------------------
+# Privacy-LLM gate-check constants (write-skill/security-checklist.md § 
Pattern 6)
+# ---------------------------------------------------------------------------
+
+# Modes that can process external / attacker-controlled content and need the
+# Privacy-LLM gate when they read private tracker bodies.  Derived from
+# docs/modes.md taxonomy constants above: Pairing is intentionally excluded
+# because the human remains in the loop; Auto-merge is currently excluded only
+# because it is in _OFF_MODES.  When the first Auto-merge skill ships, remove
+# it from _OFF_MODES so body-reading Auto-merge skills are gated by default.
+_PRIVACY_EXTERNAL_CONTENT_MODES: frozenset[str] = frozenset(ALLOWED_MODES - 
{"Pairing"})
+
+_TRACKER_PLACEHOLDER = "<tracker>"
+_TRACKER_ISSUE_VIEW_RE = re.compile(r"\bgh\s+issue\s+view\b")

Review Comment:
   `_has_tracker_body_read()` treats any `gh issue view` occurrence as a 
private `<tracker>` body read. Because `validate_privacy_patterns()` only 
checks that `<tracker>` appears somewhere in the file (not that it is tied to 
the `gh issue view` command), this can yield false positives for skills that 
mention `<tracker>` in prose but run `gh issue view` against the current 
repo/another repo. To make this check accurate, scope `gh issue view` detection 
to commands that explicitly reference the tracker (e.g., require `--repo 
<tracker>` on the same logical command) or otherwise tie the read to 
`<tracker>`.
   



##########
tools/skill-validator/tests/test_validator.py:
##########
@@ -176,6 +184,26 @@ def test_mode_optional(self, tmp_path: Path) -> None:
         violations = list(validate_frontmatter(path, text))
         assert violations == []
 
+    def test_mode_taxonomy_matches_docs_modes(self) -> None:
+        docs_modes = Path(__file__).parents[3] / "docs" / "modes.md"
+        modes_table = (
+            docs_modes.read_text(encoding="utf-8")
+            .split("## Modes at a glance", 1)[1]
+            .split("## Triage", 1)[0]
+        )

Review Comment:
   The test hard-codes repository structure via `Path(__file__).parents[3]`, 
which is brittle (any directory nesting change can break the test). Since the 
production code already has logic to locate `docs/modes.md` (via 
`_read_mode_table()`), consider reusing that location strategy in the test (or 
using the same repo-root discovery helper used elsewhere in the validator) so 
the test fails only on real taxonomy drift, not path layout changes.



##########
tools/skill-validator/src/skill_validator/__init__.py:
##########
@@ -68,9 +68,47 @@
 REQUIRED_FRONTMATTER_KEYS = {"name", "description", "license"}
 OPTIONAL_FRONTMATTER_KEYS = {"when_to_use", "mode"}
 ALLOWED_LICENSES = {"Apache-2.0"}
-# MISSION mode taxonomy — see docs/modes.md.
-# "Auto-merge" deliberately excluded: it is off per MISSION sequencing.
-ALLOWED_MODES = {"Triage", "Mentoring", "Drafting", "Pairing"}
+
+
+def _read_mode_table() -> dict[str, str]:
+    """Read the canonical MISSION mode table from ``docs/modes.md``."""
+    starts = [Path.cwd().resolve(), Path(__file__).resolve().parent]
+    roots = []
+    for start in starts:
+        roots.extend([start, *start.parents])
+
+    for root in roots:
+        modes_doc = root / DOCS_DIR / "modes.md"
+        if not modes_doc.is_file():
+            continue
+        text = modes_doc.read_text(encoding="utf-8")
+        try:
+            modes_table = text.split("## Modes at a glance", 1)[1].split("## 
Triage", 1)[0]
+        except IndexError:
+            break
+        modes: dict[str, str] = {}
+        for line in modes_table.splitlines():
+            if not line.startswith("| **"):
+                continue
+            cells = [cell.strip() for cell in line.strip("|").split("|")]
+            if len(cells) < 3:
+                continue
+            mode = cells[0].strip("*")
+            status = cells[2].strip()
+            if mode and status:
+                modes[mode] = status
+        if modes:
+            return modes
+
+    msg = "could not read mode taxonomy from docs/modes.md"
+    raise RuntimeError(msg)

Review Comment:
   This runtime error doesn’t include actionable debugging context (which paths 
were searched, whether `modes.md` was found but malformed, and what 
headings/table markers were missing). Including the resolved `modes_doc` 
path(s) checked and/or the parsing expectation (missing '## Modes at a glance' 
/ '## Triage') would make failures much easier to diagnose in CI and when 
running from different directories.



##########
tools/skill-validator/src/skill_validator/__init__.py:
##########
@@ -638,6 +708,118 @@ def validate_principle_compliance(path: Path, text: str) 
-> Iterable[Violation]:
         )
 
 
+# ---------------------------------------------------------------------------
+# Privacy-LLM gate-check (write-skill/security-checklist.md § Pattern 6)
+# ---------------------------------------------------------------------------
+
+
+def _heading_text(raw: str) -> str:
+    text = re.sub(r"\[([^\]]+)\]\([^)]+\)", r"\1", raw.strip())
+    text = text.strip("#").strip()
+    return text
+
+
+def _fenced_code_blocks(text: str) -> list[str]:
+    return [m.group(0) for m in _FENCED_CODE_RE.finditer(text)]
+
+
+def _fenced_code_blocks_in_privacy_gate_sections(text: str) -> list[str]:
+    """Return fenced code blocks inside Prerequisites / Preflight / Step 0 
sections."""
+    heading_re = re.compile(r"^(#{1,6})\s+(.+)$", re.MULTILINE)
+    headings = list(heading_re.finditer(text))
+    heading_index = 0
+    stack: list[tuple[int, str]] = []
+    blocks: list[str] = []
+
+    for block in _FENCED_CODE_RE.finditer(text):
+        while heading_index < len(headings) and 
headings[heading_index].start() < block.start():
+            heading = headings[heading_index]
+            level = len(heading.group(1))
+            title = _heading_text(heading.group(2))
+            stack = [(old_level, old_title) for old_level, old_title in stack 
if old_level < level]
+            stack.append((level, title))
+            heading_index += 1
+
+        titles = [title for _, title in stack]
+        if any(_ANTI_EXAMPLE_SECTION_RE.search(title) for title in titles):
+            continue
+        if any(_PRIVACY_GATE_SECTION_RE.search(title) for title in titles):
+            blocks.append(block.group(0))
+
+    return blocks
+
+
+def _shell_logical_lines(text: str) -> list[str]:
+    lines: list[str] = []
+    current: list[str] = []
+    for line in text.splitlines():
+        stripped = line.rstrip()
+        if stripped.endswith("\\"):
+            current.append(stripped[:-1].strip())
+            continue
+        if current:
+            current.append(stripped.strip())
+            lines.append(" ".join(part for part in current if part))
+            current = []
+        else:
+            lines.append(line)
+    if current:
+        lines.append(" ".join(part for part in current if part))
+    return lines
+
+
+def _has_tracker_body_read(text: str) -> bool:
+    body = _strip_html_comments(_skill_body(text))
+    if _TRACKER_ISSUE_VIEW_RE.search(body):
+        return True
+    for command in _shell_logical_lines(body):
+        if _TRACKER_ISSUE_API_RE.search(command) and not 
_TRACKER_ISSUE_API_MUTATION_RE.search(command):
+            return True
+    return False

Review Comment:
   `_has_tracker_body_read()` treats any `gh issue view` occurrence as a 
private `<tracker>` body read. Because `validate_privacy_patterns()` only 
checks that `<tracker>` appears somewhere in the file (not that it is tied to 
the `gh issue view` command), this can yield false positives for skills that 
mention `<tracker>` in prose but run `gh issue view` against the current 
repo/another repo. To make this check accurate, scope `gh issue view` detection 
to commands that explicitly reference the tracker (e.g., require `--repo 
<tracker>` on the same logical command) or otherwise tie the read to 
`<tracker>`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] add Privacy-LLM gate-check validator [airflow-steward]

Reply via email to