fitzee commented on PR #39915:
URL: https://github.com/apache/superset/pull/39915#issuecomment-4453159061

   ## Griffen Review — `fix(mcp): relax column name regex, improve 
generate_chart validation`
   
   *Reviewed by Griffen (AI Distinguished Engineer), with Matt Fitzgerald. 
Reach out to Matt directly if anything needs clarifying.*
   
   ---
   
   ## Context
   
   **Sources consulted:** PR description, diff, existing review comments 
(Copilot, CodeAnt)
   **Missing context:** No linked Shortcut story or GitHub issue — reviewing 
based on PR description alone.
   **Intent:** Remove an overly strict regex that rejected valid real-world 
column names (digit-prefixed, hyphenated, locale chars) from the MCP 
`generate_chart` validation layer, while tightening sanitization to compensate 
and improving error messages for LLM clients. *(PR description)*
   
   ---
   
   ## Door Classification
   
   **Two-way door**
   
   | Dimension | Assessment |
   |---|---|
   | Rollback complexity | **Simple** — revert restores the regex, removes the 
new validators, no state to unwind |
   | Blast radius | **Narrow** — scoped entirely to MCP `generate_chart` 
validation; no schema migrations, no data mutations, no public REST API surface 
affected |
   | Deployment reversible | **Yes** — stateless validation logic, no 
persistence, no protocol change |
   | Reversibility mechanisms | Not needed; inherently reversible |
   
   One caveat worth noting: if any external LLM client has been prompted or 
fine-tuned against the old Pydantic error message text, the improved error 
messages could silently break that client's error-handling logic. Still a 
two-way door operationally, but worth keeping in mind if you have clients with 
hardcoded error-response parsing.
   
   ---
   
   ## Security
   
   **The security story holds — and is improved in one spot.**
   
   - `ColumnRef.sanitize_name()` already calls `sanitize_user_input` with 
`check_sql_keywords=True`. Removing the regex doesn't weaken it.
   - `FilterConfig.sanitize_column()` previously lacked 
`check_sql_keywords=True` — this PR adds it. Good catch.
   - `BigNumberChartConfig.temporal_column` had no validator at all — this PR 
adds `sanitize_temporal_column`. Also good.
   - The old regex provided no security value not already covered by 
`sanitize_user_input`. Removing it is correct.
   
   ---
   
   ## Findings
   
   **[MEDIUM] [Two-way] — `test_script_tag_neutralized` success branch doesn't 
assert content**
   The test's success path asserts `col.name` is truthy but doesn't check that 
the sanitized value doesn't still contain `alert(1)`. An input where nh3 strips 
the tags but preserves the inner text would pass the test while storing an 
unsanitized payload.
   *Confidence: Fact (observable in test)*
   *Mitigation:* Add `assert "alert" not in col.name` in the `try` branch.
   
   **[LOW] [Two-way] — Missing XSS/event-handler test for 
`BigNumberChartConfig.temporal_column`**
   `TestColumnRefNameRelaxedPattern` and `TestFilterConfigColumnRelaxedPattern` 
both test XSS and event-handler injection. 
`TestBigNumberTemporalColumnRelaxedPattern` only tests SQL injection — the 
`on...=` pattern is untested for this field.
   *Confidence: Fact*
   *Mitigation:* Add a `test_event_handler_injection_blocked` to 
`TestBigNumberTemporalColumnRelaxedPattern`.
   
   **[NIT] — `_format_single_error` returns `""` for "no suggestion"**
   Returning an empty string to signal "no suggestion" is implicit; `tuple[str, 
str | None]` with `None` would be self-documenting at the call sites. Minor — 
the `if suggestion:` guards handle it correctly as-is.
   
   ---
   
   ## What's Good
   
   The **core design decision is sound**: the regex was defense-in-depth 
theater that blocked legitimate inputs while providing no security value the 
sanitizers didn't already cover. Removing it and simultaneously adding 
`check_sql_keywords=True` to the two fields that were missing it is the right 
tradeoff — stricter where it matters, more permissive where it was arbitrary.
   
   The **`sanitize_user_input` emptiness check refactor** is a genuine 
improvement. Inputs that reduce to empty after HTML stripping or Unicode 
normalization previously could slip through; the new post-strip checks close 
that gap correctly.
   
   The **docstring examples** for all 7 chart types are the right call for an 
MCP tool consumed by LLM clients — concrete examples do more work than schema 
descriptions.
   
   ---
   
   ## Architectural Read
   
   Narrow, well-scoped cleanup. The pattern it follows — rely on explicit 
sanitizers rather than input-shaping regexes, and make the validation layer 
match the actual security invariant rather than a proxy for it — is the right 
model for an MCP validation layer handling real-world data. No architectural 
concerns.
   
   ---
   
   **Verdict: LGTM with nits** — the two test gaps are worth fixing but not 
worth a full review cycle. Core design is sound.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to