choo121600 commented on code in PR #454: URL: https://github.com/apache/airflow-steward/pull/454#discussion_r3361127165
########## skills/pr-management-code-review/slop-detection.md: ########## @@ -0,0 +1,246 @@ +<!-- SPDX-License-Identifier: Apache-2.0 + https://www.apache.org/licenses/LICENSE-2.0 --> + +# Slop detection — structural scan + +This step runs immediately after Step 2 (diff and metadata fetched), +before the full line-by-line review in Step 3. It is cheap — purely +structural analysis, no LLM token budget — and it short-circuits the +review when a PR is clearly not a genuine upstream contribution. + +"Slop" here means a PR whose structure demonstrates it is a **class +project, personal experiment, or low-effort AI-generated submission** +being pushed into the upstream repository. The goal is to catch +crystal-clear cases early, not to flag every imperfect PR. When in +doubt, proceed with the normal review. + +--- + +## Signals + +Signals are split into **hard** (individually strong) and **soft** +(individually weak; accumulate). Most checks use only data already +in the Step 2 payload. H1 infers new-directory status from +`files[].changeType` (no base-ref tree lookup needed). H2 matches +full-URL fork references in the PR body (no issue-resolution API +call needed). H3–H5 and all soft signals are fully derivable from +the Step 2 payload with no extra `gh` calls. + +### Hard signals + +Each hard signal alone has a moderate probability of indicating slop; +two or more together are nearly conclusive. + +| ID | Signal | How to detect | +|---|---|---| +| H1 | **New standalone top-level directory** | All files under a first-level directory path are `changeType: ADDED` (the directory is entirely new in this diff), AND the directory contains a project-root file at its first level (`README.md`, `pyproject.toml`, `package.json`, `go.mod`, `pom.xml`, etc.), AND the directory name and/or any README within it suggest it is an independent project unrelated to the upstream codebase. Detection uses `files[].path` and `files[].changeType` only — no base-ref tree lookup. | +| H2 | **Private-fork issue URL in PR body** | The body contains a full GitHub issue or PR URL pointing to a repository that is not the upstream repo — pattern: `https://github.com/<author>/<repo-name>/(issues\|pull)/\d+` where `<repo-name>` differs from the upstream repo. Match against the raw body string. Do not attempt to resolve bare `#N` references; only flag explicit fork URLs. | +| H3 | **Fork merge-commit flood** | The commit list contains 3+ commit messages matching `^Merge (pull request|branch) #\d+ from` that all share the same fork prefix and were authored within a narrow window (< 60 minutes apart). | +| H4 | **Multi-author team project** | Commits are authored by 3 or more distinct GitHub logins, yet the PR is opened by a single account — typical of a university team pushing their entire fork history. | +| H5 | **Area sprawl** | Changed files span 5 or more distinct top-level directories (or well-known project sub-areas) with no discernible semantic relationship. Count using the first two path components of each changed file. | + +### Soft signals + +| ID | Signal | How to detect | +|---|---|---| +| S1 | **Ticket-style PR title** | Title matches patterns like `[Ticket #N]`, `ts/ticket-\d+`, `sprint-N`, `task-\d+`, or contains a student name followed by a ticket reference. | +| S2 | **Template-only PR body** | Body contains no prose beyond the PR template boilerplate (checked: no description above the first `---`, no non-template `closes:` / `related:` references to the upstream repo). | +| S3 | **No real CI** | `statusCheckRollup` contains only external bots (e.g. Mergeable, WIP, boring-cyborg) and zero entries from the project's own CI workflows. | +| S4 | **Label sprawl** | PR carries 3+ `area:` labels spanning unrelated subsystems, suggesting the author ran an automated labeller or copied labels from multiple separate changes. | +| S5 | **Commit messages reference internal sprint/ticket tooling** | 2+ commit messages contain phrases like `sprint`, `kanban`, `jira`, `ticket #`, `story #`, or course-code patterns like `CSS 566A` (university course identifiers). | + +--- + +## Threshold for early exit + +Run the check after computing which signals fire. Apply the rules below: + +| Condition | Action | +|---|---| +| **2+ hard signals** | Early exit — crystal-clear slop | +| **1 hard signal + 3+ soft signals** | Early exit — crystal-clear slop | +| **1 hard signal, < 3 soft** | Note only — add `[suspicious]` chip to headline, proceed with normal review | +| **0 hard signals, any soft** | Note only — add `[suspicious]` chip if ≥ 2 soft signals, otherwise silent | + +The `[suspicious]` note-only path does **not** interrupt the review +flow. It surfaces in the headline so the maintainer has the information +but is not forced to act on it before seeing the diff. + +Early exit **does** interrupt the flow: Step 3 and beyond are skipped. +The maintainer chooses an action (see below) before the skill moves on. + +--- + +## Maintainer interaction on early exit + +**Propose** a slop report in place of the normal Step 3 prompt: + +```text +⚠ Slop detection fired for PR #<N> — <title> + https://github.com/<upstream>/pull/<N> + +Hard signals: + [H1] New unrecognised top-level directory: `team_project/` + → team_project/README.md mentions "CSS 566A — Software Management, + University of Washington Bothell" + [H3] Fork merge-commit flood: 6 "Merge pull request" commits from + break-through-19/airflow within a 35-minute window + [H4] Multi-author team project: 3 distinct commit authors + (break-through-19, sanwar47, sharan-s2k) on a single-author PR + [H5] Area sprawl: changes span go-sdk/, airflow-core/ui/, + docs/adr/, team_project/ — no semantic relationship + +Soft signals: + [S1] Ticket-style title: "Poorani ts/ticket 36 adr document review" + [S2] Template-only PR body (no description, private-fork issue ref only) + [S3] No real CI (only Mergeable + WIP bots ran) + [S4] Label sprawl: area:UI + area:task-sdk + area:go-sdk + +This PR shows crystal-clear structural signals of a team class project +or personal experiment being submitted to the upstream repository. Full +line-by-line review is not warranted until these signals are resolved. + +Action? + [C]omment — post a contribution-guidelines warning on the PR + [X] — close PR, lock conversation, show report-to-GitHub link + [R]eview — proceed with full review anyway (e.g. to extract + the legitimate commits from the noise) + [S]kip — skip this PR this session + [Q]uit — end the session +``` + +Wait for explicit input before taking any action. The maintainer may +want to pick multiple actions sequentially (e.g. `[C]` then `[X]`). +If they do, execute in order and confirm before each write. + +--- + +## Action: [C] — post contribution-guidelines warning + +Draft and confirm a PR comment using the template below, then post: + +```bash +gh pr comment <N> --repo <repo> --body "$(cat <<'BODY' +[comment body here — see template] +BODY +)" +``` + +### Warning comment template Review Comment: Most of this lgtm. One minor suggestion: the current warning comment template assumes that the detected structural issues are definitely present. Since these checks are heuristic-based, there is always a chance of false positives. It might be worth adding a short note encouraging contributors to reply if they believe the PR has been misclassified. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
