andreahlert commented on code in PR #454: URL: https://github.com/apache/airflow-steward/pull/454#discussion_r3364512798
########## skills/pr-management-code-review/slop-detection.md: ########## @@ -0,0 +1,267 @@ +<!-- SPDX-License-Identifier: Apache-2.0 + https://www.apache.org/licenses/LICENSE-2.0 --> + +# Slop detection — structural scan + +This step runs immediately after Step 2 (diff and metadata fetched), +before the full line-by-line review in Step 3. It is cheap — purely +structural analysis, no LLM token budget — and it short-circuits the +review when a PR is clearly not a genuine upstream contribution. + +"Slop" here means a PR whose structure demonstrates it is a **class +project, personal experiment, or low-effort AI-generated submission** +being pushed into the upstream repository. The goal is to catch +crystal-clear cases early, not to flag every imperfect PR. When in +doubt, proceed with the normal review. + +--- + +## Signals + +Signals are split into **hard** (individually strong) and **soft** +(individually weak; accumulate). Most checks use only data already +in the Step 2 payload. H1 detects new standalone directories from +the cached unified diff (`new file mode` / `--- /dev/null` headers) +and `files[].path` — no `changeType` field, no base-ref tree lookup. +H2 matches full-URL fork references in the PR body (no +issue-resolution API call needed). H3–H5 and S2–S5 are fully +derivable from the Step 2 payload with no extra `gh` calls. S1 uses +the PR title from the Step 1 working-list cache. + +### Hard signals + +Each hard signal alone has a moderate probability of indicating slop; +two or more together are nearly conclusive. + +| ID | Signal | How to detect | +|---|---|---| +| H1 | **New standalone top-level directory** | The cached unified diff contains `+++ b/<dir>/...` entries where every path shares a single first-level directory prefix AND every file in that prefix appears as a new file in the diff (signalled by `new file mode` or `--- /dev/null` headers), AND the directory contains a project-root file at its first level (`README.md`, `pyproject.toml`, `package.json`, `go.mod`, `pom.xml`, etc.), AND the directory name and/or any README within it suggest it is an independent project unrelated to the upstream codebase. Detection uses the cached unified diff and `files[].path` — no `changeType` field, no base-ref tree lookup. | +| H2 | **Private-fork issue URL in PR body** | The body contains a full GitHub issue or PR URL pointing to a repository that is not the upstream repo — pattern: `https://github.com/<author>/<repo-name>/(issues\|pull)/\d+` where `<repo-name>` differs from the upstream repo. Match against the raw body string. Do not attempt to resolve bare `#N` references; only flag explicit fork URLs. | +| H3 | **Fork merge-commit flood** | The commit list contains 3+ commit messages matching `^Merge (pull request|branch) #\d+ from` that all share the same fork prefix and were authored within a narrow window (< 60 minutes apart). | +| H4 | **Multi-author team project** | Commits are authored by 3 or more distinct GitHub logins, yet the PR is opened by a single account — typical of a university team pushing their entire fork history. | +| H5 | **Area sprawl** | Changed files span 5 or more distinct top-level directories (or well-known project sub-areas) with no discernible semantic relationship. Count using the first two path components of each changed file. | + +### Soft signals + +| ID | Signal | How to detect | +|---|---|---| +| S1 | **Ticket-style PR title** | Title matches patterns like `[Ticket #N]`, `ts/ticket-\d+`, `sprint-N`, `task-\d+`, or contains a student name followed by a ticket reference. | +| S2 | **Template-only PR body** | Body contains no prose beyond the PR template boilerplate (checked: no description above the first `---`, no non-template `closes:` / `related:` references to the upstream repo). | +| S3 | **No real CI** | `statusCheckRollup` contains only external bots (e.g. Mergeable, WIP, boring-cyborg) and zero entries from the project's own CI workflows. | +| S4 | **Label sprawl** | PR carries 3+ `area:` labels spanning unrelated subsystems, suggesting the author ran an automated labeller or copied labels from multiple separate changes. | +| S5 | **Commit messages reference internal sprint/ticket tooling** | 2+ commit messages contain phrases like `sprint`, `kanban`, `jira`, `ticket #`, `story #`, or course-code patterns like `CSS 566A` (university course identifiers). | + +--- + +## Threshold for early exit + +Run the check after computing which signals fire. Apply the rules below: + +| Condition | Action | +|---|---| +| **2+ hard signals** | Early exit — crystal-clear slop | +| **1 hard signal + 3+ soft signals** | Early exit — crystal-clear slop | +| **1 hard signal, < 3 soft** | Note only — emit `⚠ [suspicious] — <fired signal IDs>` after the scan, proceed with normal review | +| **0 hard signals, any soft** | Note only — emit `⚠ [suspicious] — <fired signal IDs>` if ≥ 2 soft signals, otherwise silent | + +**H3 and H4 are correlated.** Both arise from the same root cause: a +team developed on a shared fork and merged internal PRs before sending +one upstream. When H3 and H4 fire *together* and no other hard signal +fires, treat them as a single hard signal for the "2+ hard signals" +threshold — a H3+H4-only pair meets note-only, not early exit. Early +exit on H3+H4 requires at least one independent additional hard signal +(H1, H2, or H5). + +The `[suspicious]` note-only path does **not** interrupt the review +flow. It surfaces in the headline so the maintainer has the information +but is not forced to act on it before seeing the diff. + +Early exit **does** interrupt the flow: Step 3 and beyond are skipped. +The maintainer chooses an action (see below) before the skill moves on. + +--- + +## Maintainer interaction on early exit + +**Propose** a slop report in place of the normal Step 3 prompt: + +```text +⚠ Slop detection fired for PR #<N> — <title> + https://github.com/<upstream>/pull/<N> + +Hard signals: + [H1] New unrecognised top-level directory: `team_project/` + → team_project/README.md mentions "CSS 566A — Software Management, + University of Washington Bothell" + [H3] Fork merge-commit flood: 6 "Merge pull request" commits from + break-through-19/airflow within a 35-minute window + [H4] Multi-author team project: 3 distinct commit authors + (break-through-19, sanwar47, sharan-s2k) on a single-author PR + [H5] Area sprawl: changes span go-sdk/, airflow-core/ui/, + docs/adr/, team_project/ — no semantic relationship Review Comment: **The example fires H5 with four areas, but H5's threshold is five.** `go-sdk/`, `airflow-core/ui/`, `docs/adr/`, `team_project/` is 4. Either add a fifth area to the example or relax the threshold. Suggesting a fifth here: ```suggestion [H5] Area sprawl: changes span go-sdk/, airflow-core/ui/, docs/adr/, providers/amazon/, team_project/ (no semantic relationship) ``` ########## skills/pr-management-code-review/slop-detection.md: ########## @@ -0,0 +1,267 @@ +<!-- SPDX-License-Identifier: Apache-2.0 + https://www.apache.org/licenses/LICENSE-2.0 --> + +# Slop detection — structural scan + +This step runs immediately after Step 2 (diff and metadata fetched), +before the full line-by-line review in Step 3. It is cheap — purely +structural analysis, no LLM token budget — and it short-circuits the Review Comment: **"No LLM token budget" reads a bit strong.** H1 ("README suggest it is an independent project") and H5 ("no discernible semantic relationship") both need the model to actually read and judge, so the scan isn't purely structural yet. This lines up with the "good candidates for deterministic checks" comment on the thread. Small softening: ```suggestion before the full line-by-line review in Step 3. It is cheap (mostly structural; H1 and H5 still need a brief read to judge project intent), and it short-circuits the ``` ########## skills/pr-management-code-review/slop-detection.md: ########## @@ -0,0 +1,267 @@ +<!-- SPDX-License-Identifier: Apache-2.0 + https://www.apache.org/licenses/LICENSE-2.0 --> + +# Slop detection — structural scan + +This step runs immediately after Step 2 (diff and metadata fetched), +before the full line-by-line review in Step 3. It is cheap — purely +structural analysis, no LLM token budget — and it short-circuits the +review when a PR is clearly not a genuine upstream contribution. + +"Slop" here means a PR whose structure demonstrates it is a **class +project, personal experiment, or low-effort AI-generated submission** +being pushed into the upstream repository. The goal is to catch +crystal-clear cases early, not to flag every imperfect PR. When in +doubt, proceed with the normal review. + +--- + +## Signals + +Signals are split into **hard** (individually strong) and **soft** +(individually weak; accumulate). Most checks use only data already +in the Step 2 payload. H1 detects new standalone directories from +the cached unified diff (`new file mode` / `--- /dev/null` headers) +and `files[].path` — no `changeType` field, no base-ref tree lookup. +H2 matches full-URL fork references in the PR body (no +issue-resolution API call needed). H3–H5 and S2–S5 are fully +derivable from the Step 2 payload with no extra `gh` calls. S1 uses +the PR title from the Step 1 working-list cache. + +### Hard signals + +Each hard signal alone has a moderate probability of indicating slop; +two or more together are nearly conclusive. + +| ID | Signal | How to detect | +|---|---|---| +| H1 | **New standalone top-level directory** | The cached unified diff contains `+++ b/<dir>/...` entries where every path shares a single first-level directory prefix AND every file in that prefix appears as a new file in the diff (signalled by `new file mode` or `--- /dev/null` headers), AND the directory contains a project-root file at its first level (`README.md`, `pyproject.toml`, `package.json`, `go.mod`, `pom.xml`, etc.), AND the directory name and/or any README within it suggest it is an independent project unrelated to the upstream codebase. Detection uses the cached unified diff and `files[].path` — no `changeType` field, no base-ref tree lookup. | Review Comment: **H1 and H5 read as mutually exclusive as worded.** "every path shares a single first-level directory prefix" means everything sits under one dir, but H5 (sprawl) needs 5+ dirs and the example fires both together. Scoping H1 to the new-file subset clears it up: ```suggestion | H1 | **New standalone top-level directory** | The cached unified diff contains a subset of `+++ b/<dir>/...` entries that all share one first-level directory prefix AND every file under that prefix appears as a new file in the diff (signalled by `new file mode` or `--- /dev/null` headers), AND that directory contains a project-root file at its first level (`README.md`, `pyproject.toml`, `package.json`, `go.mod`, `pom.xml`, etc.), AND the directory name and/or any README within it suggest it is an independent project unrelated to the upstream codebase. Detection uses the cached unified diff and `files[].path` (no `changeType` field, no base-ref tree lookup). | ``` ########## skills/pr-management-code-review/slop-detection.md: ########## @@ -0,0 +1,267 @@ +<!-- SPDX-License-Identifier: Apache-2.0 + https://www.apache.org/licenses/LICENSE-2.0 --> + +# Slop detection — structural scan + +This step runs immediately after Step 2 (diff and metadata fetched), +before the full line-by-line review in Step 3. It is cheap — purely +structural analysis, no LLM token budget — and it short-circuits the +review when a PR is clearly not a genuine upstream contribution. + +"Slop" here means a PR whose structure demonstrates it is a **class +project, personal experiment, or low-effort AI-generated submission** +being pushed into the upstream repository. The goal is to catch +crystal-clear cases early, not to flag every imperfect PR. When in +doubt, proceed with the normal review. + +--- + +## Signals + +Signals are split into **hard** (individually strong) and **soft** +(individually weak; accumulate). Most checks use only data already +in the Step 2 payload. H1 detects new standalone directories from +the cached unified diff (`new file mode` / `--- /dev/null` headers) +and `files[].path` — no `changeType` field, no base-ref tree lookup. +H2 matches full-URL fork references in the PR body (no +issue-resolution API call needed). H3–H5 and S2–S5 are fully +derivable from the Step 2 payload with no extra `gh` calls. S1 uses +the PR title from the Step 1 working-list cache. + +### Hard signals + +Each hard signal alone has a moderate probability of indicating slop; +two or more together are nearly conclusive. + +| ID | Signal | How to detect | +|---|---|---| +| H1 | **New standalone top-level directory** | The cached unified diff contains `+++ b/<dir>/...` entries where every path shares a single first-level directory prefix AND every file in that prefix appears as a new file in the diff (signalled by `new file mode` or `--- /dev/null` headers), AND the directory contains a project-root file at its first level (`README.md`, `pyproject.toml`, `package.json`, `go.mod`, `pom.xml`, etc.), AND the directory name and/or any README within it suggest it is an independent project unrelated to the upstream codebase. Detection uses the cached unified diff and `files[].path` — no `changeType` field, no base-ref tree lookup. | +| H2 | **Private-fork issue URL in PR body** | The body contains a full GitHub issue or PR URL pointing to a repository that is not the upstream repo — pattern: `https://github.com/<author>/<repo-name>/(issues\|pull)/\d+` where `<repo-name>` differs from the upstream repo. Match against the raw body string. Do not attempt to resolve bare `#N` references; only flag explicit fork URLs. | +| H3 | **Fork merge-commit flood** | The commit list contains 3+ commit messages matching `^Merge (pull request|branch) #\d+ from` that all share the same fork prefix and were authored within a narrow window (< 60 minutes apart). | +| H4 | **Multi-author team project** | Commits are authored by 3 or more distinct GitHub logins, yet the PR is opened by a single account — typical of a university team pushing their entire fork history. | +| H5 | **Area sprawl** | Changed files span 5 or more distinct top-level directories (or well-known project sub-areas) with no discernible semantic relationship. Count using the first two path components of each changed file. | + +### Soft signals + +| ID | Signal | How to detect | +|---|---|---| +| S1 | **Ticket-style PR title** | Title matches patterns like `[Ticket #N]`, `ts/ticket-\d+`, `sprint-N`, `task-\d+`, or contains a student name followed by a ticket reference. | +| S2 | **Template-only PR body** | Body contains no prose beyond the PR template boilerplate (checked: no description above the first `---`, no non-template `closes:` / `related:` references to the upstream repo). | +| S3 | **No real CI** | `statusCheckRollup` contains only external bots (e.g. Mergeable, WIP, boring-cyborg) and zero entries from the project's own CI workflows. | +| S4 | **Label sprawl** | PR carries 3+ `area:` labels spanning unrelated subsystems, suggesting the author ran an automated labeller or copied labels from multiple separate changes. | +| S5 | **Commit messages reference internal sprint/ticket tooling** | 2+ commit messages contain phrases like `sprint`, `kanban`, `jira`, `ticket #`, `story #`, or course-code patterns like `CSS 566A` (university course identifiers). | + +--- + +## Threshold for early exit + +Run the check after computing which signals fire. Apply the rules below: + +| Condition | Action | +|---|---| +| **2+ hard signals** | Early exit — crystal-clear slop | +| **1 hard signal + 3+ soft signals** | Early exit — crystal-clear slop | +| **1 hard signal, < 3 soft** | Note only — emit `⚠ [suspicious] — <fired signal IDs>` after the scan, proceed with normal review | +| **0 hard signals, any soft** | Note only — emit `⚠ [suspicious] — <fired signal IDs>` if ≥ 2 soft signals, otherwise silent | + +**H3 and H4 are correlated.** Both arise from the same root cause: a +team developed on a shared fork and merged internal PRs before sending +one upstream. When H3 and H4 fire *together* and no other hard signal +fires, treat them as a single hard signal for the "2+ hard signals" +threshold — a H3+H4-only pair meets note-only, not early exit. Early +exit on H3+H4 requires at least one independent additional hard signal +(H1, H2, or H5). + +The `[suspicious]` note-only path does **not** interrupt the review +flow. It surfaces in the headline so the maintainer has the information +but is not forced to act on it before seeing the diff. + +Early exit **does** interrupt the flow: Step 3 and beyond are skipped. +The maintainer chooses an action (see below) before the skill moves on. + +--- + +## Maintainer interaction on early exit + +**Propose** a slop report in place of the normal Step 3 prompt: + +```text +⚠ Slop detection fired for PR #<N> — <title> + https://github.com/<upstream>/pull/<N> + +Hard signals: + [H1] New unrecognised top-level directory: `team_project/` + → team_project/README.md mentions "CSS 566A — Software Management, + University of Washington Bothell" + [H3] Fork merge-commit flood: 6 "Merge pull request" commits from + break-through-19/airflow within a 35-minute window + [H4] Multi-author team project: 3 distinct commit authors + (break-through-19, sanwar47, sharan-s2k) on a single-author PR + [H5] Area sprawl: changes span go-sdk/, airflow-core/ui/, + docs/adr/, team_project/ — no semantic relationship + +Soft signals: + [S1] Ticket-style title: "Poorani ts/ticket 36 adr document review" + [S2] Template-only PR body (no description, private-fork issue ref only) + [S3] No real CI (only Mergeable + WIP bots ran) + [S4] Label sprawl: area:UI + area:task-sdk + area:go-sdk + +This PR shows crystal-clear structural signals of a team class project +or personal experiment being submitted to the upstream repository. Full +line-by-line review is not warranted until these signals are resolved. + +Action? + [C]omment — post a contribution-guidelines warning on the PR + [X] — close PR, lock conversation, show report-to-GitHub link + [R]eview — proceed with full review anyway (e.g. to extract + the legitimate commits from the noise) + [S]kip — skip this PR this session + [Q]uit — end the session +``` + +Wait for explicit input before taking any action. The maintainer may +want to pick multiple actions sequentially (e.g. `[C]` then `[X]`). +If they do, execute in order and confirm before each write. + +--- + +## Action: [C] — post contribution-guidelines warning + +Draft and confirm a PR comment using the template below, then post: + +```bash +gh pr comment <N> --repo <repo> --body "$(cat <<'BODY' +[comment body here — see template] +BODY +)" +``` + +### Warning comment template + +```markdown +Thank you for your interest in Apache <PROJECT>. Unfortunately this PR +cannot be accepted in its current form. + +**Structural issues detected:** + +[List each fired signal as a plain-English sentence. Example:] + +- The `team_project/` directory appears to be a student class project + unrelated to Apache <PROJECT>. +- The PR bundles several independent changes with no shared purpose. +- The PR description does not explain what problem the changes solve + or reference an upstream issue. + +**What to do instead:** + +1. Remove any files that are not genuine upstream contributions. +2. Split the remaining changes into separate, focused PRs — one PR + per logical change. +3. Each PR should include a clear description of the problem it + solves and a reference to the relevant upstream issue (or a + justification if no issue exists). +4. Please read the [contribution guidelines](<contributing-docs-url>) + before opening a new PR. + +We welcome genuine contributions and are happy to help if you have +questions about the process. + +If you believe this assessment is incorrect and your changes are a +genuine upstream contribution, please reply to this comment explaining +the purpose of your PR and a maintainer will take another look. + +--- +*Drafted by an AI-assisted tool; reviewed by @<viewer> before posting.* +``` + +The `<contributing-docs-url>` is the adopter's contributing guide, read +from `<project-config>/project.md → contributing_docs_url`. If not set, +link to the repo's `CONTRIBUTING.md`. + +Substitute `<PROJECT>` with the project name from +`<project-config>/project.md → project_name`. + +After the comment is posted, return to the action menu to allow a +follow-up `[X]` close if the maintainer wants to. + +--- + +## Action: [X] — close, lock, and prompt to report + +**Propose** the sequence of operations, then **confirm** before executing: + +> *About to: close PR #N, lock the conversation (reason: off-topic), +> and show you the report link. Confirm? `[Y]es` / `[N]o`.* + +On confirm, execute in order: + +```bash +# 1. Close the PR +gh pr close <N> --repo <repo> + +# 2. Lock the conversation +# <repo> is owner/name form (e.g. apache/airflow) +gh api --method PUT "repos/<repo>/issues/<N>/lock" \ + --field lock_reason=off-topic +``` + +Then surface the report link (cannot be automated — GitHub does not +expose a report API): + +```text +To report this PR to GitHub (optional — only for genuine spam): + 1. Open: https://github.com/<upstream>/pull/<N> + 2. Click the "…" menu (top-right of the PR header). + 3. Select "Report content". + 4. Choose the appropriate reason. + Note: "Spam or misleading" is for deceptive content, not for + misdirected class projects. Most slop-detected PRs should + simply be closed without a report. +``` + +Note in the session summary that this PR was closed and locked, with +the timestamp and the maintainer's stated reason. + +--- + +## [R] — review anyway + +Proceed with Step 3 as normal. Add a `[slop-signals present]` note +to the session summary so the maintainer can reference which signals +were detected even if they chose not to act on them. + +Use this path when the PR contains a mix of legitimate and illegitimate +changes and the maintainer wants to isolate the legitimate commits +for a cherry-pick or to direct the author to split the PR correctly. + +--- + +## In the session summary + +For each PR that triggered early exit, record: + +- Fired signals (hard + soft, by ID) +- Action taken: `comment` / `close+lock` / `review-anyway` / `skip` +- For `close+lock`: timestamp and whether the maintainer reported to GitHub + +This gives the maintainer an audit trail without requiring them to +remember which PRs they handled as slop. + +--- + +## False-positive calibration + +The threshold is deliberately conservative. A PR that looks suspicious +but doesn't cross the 2-hard-signal or 1-hard-3-soft threshold proceeds +with the normal review. The `[suspicious]` chip in the headline is the +only signal — no interruption, no menu. Review Comment: **Same chip-vs-line mismatch here.** Step 2.5 emits a separate line, not a headline chip. Matching it: ```suggestion with the normal review. The separate `[suspicious]` line emitted after the scan is the only signal (no interruption, no menu). ``` ########## skills/pr-management-code-review/slop-detection.md: ########## @@ -0,0 +1,267 @@ +<!-- SPDX-License-Identifier: Apache-2.0 + https://www.apache.org/licenses/LICENSE-2.0 --> + +# Slop detection — structural scan + +This step runs immediately after Step 2 (diff and metadata fetched), +before the full line-by-line review in Step 3. It is cheap — purely +structural analysis, no LLM token budget — and it short-circuits the +review when a PR is clearly not a genuine upstream contribution. + +"Slop" here means a PR whose structure demonstrates it is a **class +project, personal experiment, or low-effort AI-generated submission** +being pushed into the upstream repository. The goal is to catch +crystal-clear cases early, not to flag every imperfect PR. When in +doubt, proceed with the normal review. + +--- + +## Signals + +Signals are split into **hard** (individually strong) and **soft** +(individually weak; accumulate). Most checks use only data already +in the Step 2 payload. H1 detects new standalone directories from +the cached unified diff (`new file mode` / `--- /dev/null` headers) +and `files[].path` — no `changeType` field, no base-ref tree lookup. +H2 matches full-URL fork references in the PR body (no +issue-resolution API call needed). H3–H5 and S2–S5 are fully +derivable from the Step 2 payload with no extra `gh` calls. S1 uses +the PR title from the Step 1 working-list cache. + +### Hard signals + +Each hard signal alone has a moderate probability of indicating slop; +two or more together are nearly conclusive. + +| ID | Signal | How to detect | +|---|---|---| +| H1 | **New standalone top-level directory** | The cached unified diff contains `+++ b/<dir>/...` entries where every path shares a single first-level directory prefix AND every file in that prefix appears as a new file in the diff (signalled by `new file mode` or `--- /dev/null` headers), AND the directory contains a project-root file at its first level (`README.md`, `pyproject.toml`, `package.json`, `go.mod`, `pom.xml`, etc.), AND the directory name and/or any README within it suggest it is an independent project unrelated to the upstream codebase. Detection uses the cached unified diff and `files[].path` — no `changeType` field, no base-ref tree lookup. | +| H2 | **Private-fork issue URL in PR body** | The body contains a full GitHub issue or PR URL pointing to a repository that is not the upstream repo — pattern: `https://github.com/<author>/<repo-name>/(issues\|pull)/\d+` where `<repo-name>` differs from the upstream repo. Match against the raw body string. Do not attempt to resolve bare `#N` references; only flag explicit fork URLs. | +| H3 | **Fork merge-commit flood** | The commit list contains 3+ commit messages matching `^Merge (pull request|branch) #\d+ from` that all share the same fork prefix and were authored within a narrow window (< 60 minutes apart). | +| H4 | **Multi-author team project** | Commits are authored by 3 or more distinct GitHub logins, yet the PR is opened by a single account — typical of a university team pushing their entire fork history. | Review Comment: **H4 can silently undercount.** `commits[].authors[].login` comes back empty when a commit email isn't linked to a GitHub account, common for the student contributors this targets. Falling back to name/email: ```suggestion | H4 | **Multi-author team project** | Commits are authored by 3 or more distinct contributors, yet the PR is opened by a single account, typical of a university team pushing their entire fork history. Count distinct `commits[].authors[].login`, falling back to author name/email when `login` is empty (unlinked commit emails are common for student contributors). | ``` ########## skills/pr-management-code-review/slop-detection.md: ########## @@ -0,0 +1,267 @@ +<!-- SPDX-License-Identifier: Apache-2.0 + https://www.apache.org/licenses/LICENSE-2.0 --> + +# Slop detection — structural scan + +This step runs immediately after Step 2 (diff and metadata fetched), +before the full line-by-line review in Step 3. It is cheap — purely +structural analysis, no LLM token budget — and it short-circuits the +review when a PR is clearly not a genuine upstream contribution. + +"Slop" here means a PR whose structure demonstrates it is a **class +project, personal experiment, or low-effort AI-generated submission** +being pushed into the upstream repository. The goal is to catch +crystal-clear cases early, not to flag every imperfect PR. When in +doubt, proceed with the normal review. + +--- + +## Signals + +Signals are split into **hard** (individually strong) and **soft** +(individually weak; accumulate). Most checks use only data already +in the Step 2 payload. H1 detects new standalone directories from +the cached unified diff (`new file mode` / `--- /dev/null` headers) +and `files[].path` — no `changeType` field, no base-ref tree lookup. +H2 matches full-URL fork references in the PR body (no +issue-resolution API call needed). H3–H5 and S2–S5 are fully +derivable from the Step 2 payload with no extra `gh` calls. S1 uses +the PR title from the Step 1 working-list cache. + +### Hard signals + +Each hard signal alone has a moderate probability of indicating slop; +two or more together are nearly conclusive. + +| ID | Signal | How to detect | +|---|---|---| +| H1 | **New standalone top-level directory** | The cached unified diff contains `+++ b/<dir>/...` entries where every path shares a single first-level directory prefix AND every file in that prefix appears as a new file in the diff (signalled by `new file mode` or `--- /dev/null` headers), AND the directory contains a project-root file at its first level (`README.md`, `pyproject.toml`, `package.json`, `go.mod`, `pom.xml`, etc.), AND the directory name and/or any README within it suggest it is an independent project unrelated to the upstream codebase. Detection uses the cached unified diff and `files[].path` — no `changeType` field, no base-ref tree lookup. | +| H2 | **Private-fork issue URL in PR body** | The body contains a full GitHub issue or PR URL pointing to a repository that is not the upstream repo — pattern: `https://github.com/<author>/<repo-name>/(issues\|pull)/\d+` where `<repo-name>` differs from the upstream repo. Match against the raw body string. Do not attempt to resolve bare `#N` references; only flag explicit fork URLs. | +| H3 | **Fork merge-commit flood** | The commit list contains 3+ commit messages matching `^Merge (pull request|branch) #\d+ from` that all share the same fork prefix and were authored within a narrow window (< 60 minutes apart). | +| H4 | **Multi-author team project** | Commits are authored by 3 or more distinct GitHub logins, yet the PR is opened by a single account — typical of a university team pushing their entire fork history. | +| H5 | **Area sprawl** | Changed files span 5 or more distinct top-level directories (or well-known project sub-areas) with no discernible semantic relationship. Count using the first two path components of each changed file. | + +### Soft signals + +| ID | Signal | How to detect | +|---|---|---| +| S1 | **Ticket-style PR title** | Title matches patterns like `[Ticket #N]`, `ts/ticket-\d+`, `sprint-N`, `task-\d+`, or contains a student name followed by a ticket reference. | +| S2 | **Template-only PR body** | Body contains no prose beyond the PR template boilerplate (checked: no description above the first `---`, no non-template `closes:` / `related:` references to the upstream repo). | +| S3 | **No real CI** | `statusCheckRollup` contains only external bots (e.g. Mergeable, WIP, boring-cyborg) and zero entries from the project's own CI workflows. | +| S4 | **Label sprawl** | PR carries 3+ `area:` labels spanning unrelated subsystems, suggesting the author ran an automated labeller or copied labels from multiple separate changes. | +| S5 | **Commit messages reference internal sprint/ticket tooling** | 2+ commit messages contain phrases like `sprint`, `kanban`, `jira`, `ticket #`, `story #`, or course-code patterns like `CSS 566A` (university course identifiers). | + +--- + +## Threshold for early exit + +Run the check after computing which signals fire. Apply the rules below: + +| Condition | Action | +|---|---| +| **2+ hard signals** | Early exit — crystal-clear slop | +| **1 hard signal + 3+ soft signals** | Early exit — crystal-clear slop | +| **1 hard signal, < 3 soft** | Note only — emit `⚠ [suspicious] — <fired signal IDs>` after the scan, proceed with normal review | +| **0 hard signals, any soft** | Note only — emit `⚠ [suspicious] — <fired signal IDs>` if ≥ 2 soft signals, otherwise silent | + +**H3 and H4 are correlated.** Both arise from the same root cause: a +team developed on a shared fork and merged internal PRs before sending +one upstream. When H3 and H4 fire *together* and no other hard signal +fires, treat them as a single hard signal for the "2+ hard signals" +threshold — a H3+H4-only pair meets note-only, not early exit. Early +exit on H3+H4 requires at least one independent additional hard signal +(H1, H2, or H5). + +The `[suspicious]` note-only path does **not** interrupt the review +flow. It surfaces in the headline so the maintainer has the information +but is not forced to act on it before seeing the diff. + +Early exit **does** interrupt the flow: Step 3 and beyond are skipped. +The maintainer chooses an action (see below) before the skill moves on. + +--- + +## Maintainer interaction on early exit + +**Propose** a slop report in place of the normal Step 3 prompt: + +```text +⚠ Slop detection fired for PR #<N> — <title> + https://github.com/<upstream>/pull/<N> + +Hard signals: + [H1] New unrecognised top-level directory: `team_project/` + → team_project/README.md mentions "CSS 566A — Software Management, + University of Washington Bothell" + [H3] Fork merge-commit flood: 6 "Merge pull request" commits from + break-through-19/airflow within a 35-minute window + [H4] Multi-author team project: 3 distinct commit authors + (break-through-19, sanwar47, sharan-s2k) on a single-author PR + [H5] Area sprawl: changes span go-sdk/, airflow-core/ui/, + docs/adr/, team_project/ — no semantic relationship + +Soft signals: + [S1] Ticket-style title: "Poorani ts/ticket 36 adr document review" + [S2] Template-only PR body (no description, private-fork issue ref only) + [S3] No real CI (only Mergeable + WIP bots ran) + [S4] Label sprawl: area:UI + area:task-sdk + area:go-sdk + +This PR shows crystal-clear structural signals of a team class project +or personal experiment being submitted to the upstream repository. Full +line-by-line review is not warranted until these signals are resolved. + +Action? + [C]omment — post a contribution-guidelines warning on the PR + [X] — close PR, lock conversation, show report-to-GitHub link + [R]eview — proceed with full review anyway (e.g. to extract + the legitimate commits from the noise) + [S]kip — skip this PR this session + [Q]uit — end the session +``` + +Wait for explicit input before taking any action. The maintainer may +want to pick multiple actions sequentially (e.g. `[C]` then `[X]`). +If they do, execute in order and confirm before each write. + +--- + +## Action: [C] — post contribution-guidelines warning + +Draft and confirm a PR comment using the template below, then post: + +```bash +gh pr comment <N> --repo <repo> --body "$(cat <<'BODY' +[comment body here — see template] +BODY +)" +``` + +### Warning comment template + +```markdown +Thank you for your interest in Apache <PROJECT>. Unfortunately this PR +cannot be accepted in its current form. + +**Structural issues detected:** + +[List each fired signal as a plain-English sentence. Example:] + +- The `team_project/` directory appears to be a student class project + unrelated to Apache <PROJECT>. +- The PR bundles several independent changes with no shared purpose. +- The PR description does not explain what problem the changes solve + or reference an upstream issue. + +**What to do instead:** + +1. Remove any files that are not genuine upstream contributions. +2. Split the remaining changes into separate, focused PRs — one PR + per logical change. +3. Each PR should include a clear description of the problem it + solves and a reference to the relevant upstream issue (or a + justification if no issue exists). +4. Please read the [contribution guidelines](<contributing-docs-url>) + before opening a new PR. + +We welcome genuine contributions and are happy to help if you have +questions about the process. + +If you believe this assessment is incorrect and your changes are a +genuine upstream contribution, please reply to this comment explaining +the purpose of your PR and a maintainer will take another look. + +--- +*Drafted by an AI-assisted tool; reviewed by @<viewer> before posting.* +``` + +The `<contributing-docs-url>` is the adopter's contributing guide, read +from `<project-config>/project.md → contributing_docs_url`. If not set, +link to the repo's `CONTRIBUTING.md`. + +Substitute `<PROJECT>` with the project name from +`<project-config>/project.md → project_name`. + +After the comment is posted, return to the action menu to allow a +follow-up `[X]` close if the maintainer wants to. + +--- + +## Action: [X] — close, lock, and prompt to report Review Comment: **Note (non-blocking):** close+lock now lives both here (`[X]`) and in `pr-management-triage`. The carve-out reads deliberate and fine; just worth a reminder to keep the two in sync (lock reason, templates) if either side changes later. ########## skills/pr-management-code-review/slop-detection.md: ########## @@ -0,0 +1,267 @@ +<!-- SPDX-License-Identifier: Apache-2.0 + https://www.apache.org/licenses/LICENSE-2.0 --> + +# Slop detection — structural scan + +This step runs immediately after Step 2 (diff and metadata fetched), +before the full line-by-line review in Step 3. It is cheap — purely +structural analysis, no LLM token budget — and it short-circuits the +review when a PR is clearly not a genuine upstream contribution. + +"Slop" here means a PR whose structure demonstrates it is a **class +project, personal experiment, or low-effort AI-generated submission** +being pushed into the upstream repository. The goal is to catch +crystal-clear cases early, not to flag every imperfect PR. When in +doubt, proceed with the normal review. + +--- + +## Signals + +Signals are split into **hard** (individually strong) and **soft** +(individually weak; accumulate). Most checks use only data already +in the Step 2 payload. H1 detects new standalone directories from +the cached unified diff (`new file mode` / `--- /dev/null` headers) +and `files[].path` — no `changeType` field, no base-ref tree lookup. +H2 matches full-URL fork references in the PR body (no +issue-resolution API call needed). H3–H5 and S2–S5 are fully +derivable from the Step 2 payload with no extra `gh` calls. S1 uses +the PR title from the Step 1 working-list cache. + +### Hard signals + +Each hard signal alone has a moderate probability of indicating slop; +two or more together are nearly conclusive. + +| ID | Signal | How to detect | +|---|---|---| +| H1 | **New standalone top-level directory** | The cached unified diff contains `+++ b/<dir>/...` entries where every path shares a single first-level directory prefix AND every file in that prefix appears as a new file in the diff (signalled by `new file mode` or `--- /dev/null` headers), AND the directory contains a project-root file at its first level (`README.md`, `pyproject.toml`, `package.json`, `go.mod`, `pom.xml`, etc.), AND the directory name and/or any README within it suggest it is an independent project unrelated to the upstream codebase. Detection uses the cached unified diff and `files[].path` — no `changeType` field, no base-ref tree lookup. | +| H2 | **Private-fork issue URL in PR body** | The body contains a full GitHub issue or PR URL pointing to a repository that is not the upstream repo — pattern: `https://github.com/<author>/<repo-name>/(issues\|pull)/\d+` where `<repo-name>` differs from the upstream repo. Match against the raw body string. Do not attempt to resolve bare `#N` references; only flag explicit fork URLs. | +| H3 | **Fork merge-commit flood** | The commit list contains 3+ commit messages matching `^Merge (pull request|branch) #\d+ from` that all share the same fork prefix and were authored within a narrow window (< 60 minutes apart). | +| H4 | **Multi-author team project** | Commits are authored by 3 or more distinct GitHub logins, yet the PR is opened by a single account — typical of a university team pushing their entire fork history. | +| H5 | **Area sprawl** | Changed files span 5 or more distinct top-level directories (or well-known project sub-areas) with no discernible semantic relationship. Count using the first two path components of each changed file. | + +### Soft signals + +| ID | Signal | How to detect | +|---|---|---| +| S1 | **Ticket-style PR title** | Title matches patterns like `[Ticket #N]`, `ts/ticket-\d+`, `sprint-N`, `task-\d+`, or contains a student name followed by a ticket reference. | +| S2 | **Template-only PR body** | Body contains no prose beyond the PR template boilerplate (checked: no description above the first `---`, no non-template `closes:` / `related:` references to the upstream repo). | +| S3 | **No real CI** | `statusCheckRollup` contains only external bots (e.g. Mergeable, WIP, boring-cyborg) and zero entries from the project's own CI workflows. | +| S4 | **Label sprawl** | PR carries 3+ `area:` labels spanning unrelated subsystems, suggesting the author ran an automated labeller or copied labels from multiple separate changes. | +| S5 | **Commit messages reference internal sprint/ticket tooling** | 2+ commit messages contain phrases like `sprint`, `kanban`, `jira`, `ticket #`, `story #`, or course-code patterns like `CSS 566A` (university course identifiers). | + +--- + +## Threshold for early exit + +Run the check after computing which signals fire. Apply the rules below: + +| Condition | Action | +|---|---| +| **2+ hard signals** | Early exit — crystal-clear slop | +| **1 hard signal + 3+ soft signals** | Early exit — crystal-clear slop | +| **1 hard signal, < 3 soft** | Note only — emit `⚠ [suspicious] — <fired signal IDs>` after the scan, proceed with normal review | +| **0 hard signals, any soft** | Note only — emit `⚠ [suspicious] — <fired signal IDs>` if ≥ 2 soft signals, otherwise silent | + +**H3 and H4 are correlated.** Both arise from the same root cause: a +team developed on a shared fork and merged internal PRs before sending +one upstream. When H3 and H4 fire *together* and no other hard signal +fires, treat them as a single hard signal for the "2+ hard signals" +threshold — a H3+H4-only pair meets note-only, not early exit. Early +exit on H3+H4 requires at least one independent additional hard signal +(H1, H2, or H5). + +The `[suspicious]` note-only path does **not** interrupt the review +flow. It surfaces in the headline so the maintainer has the information +but is not forced to act on it before seeing the diff. Review Comment: **This contradicts `review-flow.md` Step 2.5**, which says "do not attempt to modify the already-displayed Step 1 headline" and emits a separate line. Aligning the prose: ```suggestion The `[suspicious]` note-only path does **not** interrupt the review flow. It is emitted as a separate line immediately after the scan, leaving the already-displayed Step 1 headline untouched, so the maintainer has the information but is not forced to act on it before seeing the diff. ``` ########## skills/pr-management-code-review/slop-detection.md: ########## @@ -0,0 +1,267 @@ +<!-- SPDX-License-Identifier: Apache-2.0 + https://www.apache.org/licenses/LICENSE-2.0 --> + +# Slop detection — structural scan + +This step runs immediately after Step 2 (diff and metadata fetched), +before the full line-by-line review in Step 3. It is cheap — purely +structural analysis, no LLM token budget — and it short-circuits the +review when a PR is clearly not a genuine upstream contribution. + +"Slop" here means a PR whose structure demonstrates it is a **class +project, personal experiment, or low-effort AI-generated submission** +being pushed into the upstream repository. The goal is to catch +crystal-clear cases early, not to flag every imperfect PR. When in +doubt, proceed with the normal review. + +--- + +## Signals + +Signals are split into **hard** (individually strong) and **soft** +(individually weak; accumulate). Most checks use only data already +in the Step 2 payload. H1 detects new standalone directories from +the cached unified diff (`new file mode` / `--- /dev/null` headers) +and `files[].path` — no `changeType` field, no base-ref tree lookup. +H2 matches full-URL fork references in the PR body (no +issue-resolution API call needed). H3–H5 and S2–S5 are fully +derivable from the Step 2 payload with no extra `gh` calls. S1 uses +the PR title from the Step 1 working-list cache. + +### Hard signals + +Each hard signal alone has a moderate probability of indicating slop; +two or more together are nearly conclusive. + +| ID | Signal | How to detect | +|---|---|---| +| H1 | **New standalone top-level directory** | The cached unified diff contains `+++ b/<dir>/...` entries where every path shares a single first-level directory prefix AND every file in that prefix appears as a new file in the diff (signalled by `new file mode` or `--- /dev/null` headers), AND the directory contains a project-root file at its first level (`README.md`, `pyproject.toml`, `package.json`, `go.mod`, `pom.xml`, etc.), AND the directory name and/or any README within it suggest it is an independent project unrelated to the upstream codebase. Detection uses the cached unified diff and `files[].path` — no `changeType` field, no base-ref tree lookup. | +| H2 | **Private-fork issue URL in PR body** | The body contains a full GitHub issue or PR URL pointing to a repository that is not the upstream repo — pattern: `https://github.com/<author>/<repo-name>/(issues\|pull)/\d+` where `<repo-name>` differs from the upstream repo. Match against the raw body string. Do not attempt to resolve bare `#N` references; only flag explicit fork URLs. | Review Comment: **H2 can catch legitimate cross-repo links.** As worded it only checks `<repo-name>` differs from upstream, so a body referencing `https://github.com/apache/airflow/issues/123` from here would match. Matching `<author>` to the PR author keeps it to actual author forks: ```suggestion | H2 | **Private-fork issue URL in PR body** | The body contains a full GitHub issue or PR URL whose `<author>` matches the PR author but whose `<repo-name>` differs from the upstream repo, pattern: `https://github.com/<author>/<repo-name>/(issues\|pull)/\d+`. Matching `<author>` to the PR author avoids flagging legitimate cross-repo links (e.g. a reference to another Apache repo). Match against the raw body string. Do not attempt to resolve bare `#N` references; only flag explicit fork URLs. | ``` ########## skills/pr-management-code-review/slop-detection.md: ########## @@ -0,0 +1,267 @@ +<!-- SPDX-License-Identifier: Apache-2.0 + https://www.apache.org/licenses/LICENSE-2.0 --> + +# Slop detection — structural scan + +This step runs immediately after Step 2 (diff and metadata fetched), +before the full line-by-line review in Step 3. It is cheap — purely +structural analysis, no LLM token budget — and it short-circuits the +review when a PR is clearly not a genuine upstream contribution. + +"Slop" here means a PR whose structure demonstrates it is a **class +project, personal experiment, or low-effort AI-generated submission** +being pushed into the upstream repository. The goal is to catch +crystal-clear cases early, not to flag every imperfect PR. When in +doubt, proceed with the normal review. + +--- + +## Signals + +Signals are split into **hard** (individually strong) and **soft** +(individually weak; accumulate). Most checks use only data already +in the Step 2 payload. H1 detects new standalone directories from +the cached unified diff (`new file mode` / `--- /dev/null` headers) +and `files[].path` — no `changeType` field, no base-ref tree lookup. +H2 matches full-URL fork references in the PR body (no +issue-resolution API call needed). H3–H5 and S2–S5 are fully +derivable from the Step 2 payload with no extra `gh` calls. S1 uses +the PR title from the Step 1 working-list cache. + +### Hard signals + +Each hard signal alone has a moderate probability of indicating slop; +two or more together are nearly conclusive. + +| ID | Signal | How to detect | +|---|---|---| +| H1 | **New standalone top-level directory** | The cached unified diff contains `+++ b/<dir>/...` entries where every path shares a single first-level directory prefix AND every file in that prefix appears as a new file in the diff (signalled by `new file mode` or `--- /dev/null` headers), AND the directory contains a project-root file at its first level (`README.md`, `pyproject.toml`, `package.json`, `go.mod`, `pom.xml`, etc.), AND the directory name and/or any README within it suggest it is an independent project unrelated to the upstream codebase. Detection uses the cached unified diff and `files[].path` — no `changeType` field, no base-ref tree lookup. | +| H2 | **Private-fork issue URL in PR body** | The body contains a full GitHub issue or PR URL pointing to a repository that is not the upstream repo — pattern: `https://github.com/<author>/<repo-name>/(issues\|pull)/\d+` where `<repo-name>` differs from the upstream repo. Match against the raw body string. Do not attempt to resolve bare `#N` references; only flag explicit fork URLs. | +| H3 | **Fork merge-commit flood** | The commit list contains 3+ commit messages matching `^Merge (pull request|branch) #\d+ from` that all share the same fork prefix and were authored within a narrow window (< 60 minutes apart). | +| H4 | **Multi-author team project** | Commits are authored by 3 or more distinct GitHub logins, yet the PR is opened by a single account — typical of a university team pushing their entire fork history. | +| H5 | **Area sprawl** | Changed files span 5 or more distinct top-level directories (or well-known project sub-areas) with no discernible semantic relationship. Count using the first two path components of each changed file. | + +### Soft signals + +| ID | Signal | How to detect | +|---|---|---| +| S1 | **Ticket-style PR title** | Title matches patterns like `[Ticket #N]`, `ts/ticket-\d+`, `sprint-N`, `task-\d+`, or contains a student name followed by a ticket reference. | +| S2 | **Template-only PR body** | Body contains no prose beyond the PR template boilerplate (checked: no description above the first `---`, no non-template `closes:` / `related:` references to the upstream repo). | +| S3 | **No real CI** | `statusCheckRollup` contains only external bots (e.g. Mergeable, WIP, boring-cyborg) and zero entries from the project's own CI workflows. | +| S4 | **Label sprawl** | PR carries 3+ `area:` labels spanning unrelated subsystems, suggesting the author ran an automated labeller or copied labels from multiple separate changes. | +| S5 | **Commit messages reference internal sprint/ticket tooling** | 2+ commit messages contain phrases like `sprint`, `kanban`, `jira`, `ticket #`, `story #`, or course-code patterns like `CSS 566A` (university course identifiers). | + +--- + +## Threshold for early exit + +Run the check after computing which signals fire. Apply the rules below: + +| Condition | Action | +|---|---| +| **2+ hard signals** | Early exit — crystal-clear slop | +| **1 hard signal + 3+ soft signals** | Early exit — crystal-clear slop | +| **1 hard signal, < 3 soft** | Note only — emit `⚠ [suspicious] — <fired signal IDs>` after the scan, proceed with normal review | +| **0 hard signals, any soft** | Note only — emit `⚠ [suspicious] — <fired signal IDs>` if ≥ 2 soft signals, otherwise silent | + +**H3 and H4 are correlated.** Both arise from the same root cause: a +team developed on a shared fork and merged internal PRs before sending +one upstream. When H3 and H4 fire *together* and no other hard signal +fires, treat them as a single hard signal for the "2+ hard signals" +threshold — a H3+H4-only pair meets note-only, not early exit. Early +exit on H3+H4 requires at least one independent additional hard signal +(H1, H2, or H5). + +The `[suspicious]` note-only path does **not** interrupt the review +flow. It surfaces in the headline so the maintainer has the information +but is not forced to act on it before seeing the diff. + +Early exit **does** interrupt the flow: Step 3 and beyond are skipped. +The maintainer chooses an action (see below) before the skill moves on. + +--- + +## Maintainer interaction on early exit + +**Propose** a slop report in place of the normal Step 3 prompt: + +```text +⚠ Slop detection fired for PR #<N> — <title> + https://github.com/<upstream>/pull/<N> + +Hard signals: + [H1] New unrecognised top-level directory: `team_project/` + → team_project/README.md mentions "CSS 566A — Software Management, + University of Washington Bothell" + [H3] Fork merge-commit flood: 6 "Merge pull request" commits from + break-through-19/airflow within a 35-minute window + [H4] Multi-author team project: 3 distinct commit authors + (break-through-19, sanwar47, sharan-s2k) on a single-author PR + [H5] Area sprawl: changes span go-sdk/, airflow-core/ui/, + docs/adr/, team_project/ — no semantic relationship + +Soft signals: + [S1] Ticket-style title: "Poorani ts/ticket 36 adr document review" + [S2] Template-only PR body (no description, private-fork issue ref only) + [S3] No real CI (only Mergeable + WIP bots ran) + [S4] Label sprawl: area:UI + area:task-sdk + area:go-sdk + +This PR shows crystal-clear structural signals of a team class project +or personal experiment being submitted to the upstream repository. Full +line-by-line review is not warranted until these signals are resolved. + +Action? + [C]omment — post a contribution-guidelines warning on the PR + [X] — close PR, lock conversation, show report-to-GitHub link + [R]eview — proceed with full review anyway (e.g. to extract + the legitimate commits from the noise) + [S]kip — skip this PR this session + [Q]uit — end the session +``` + +Wait for explicit input before taking any action. The maintainer may +want to pick multiple actions sequentially (e.g. `[C]` then `[X]`). +If they do, execute in order and confirm before each write. + +--- + +## Action: [C] — post contribution-guidelines warning + +Draft and confirm a PR comment using the template below, then post: + +```bash +gh pr comment <N> --repo <repo> --body "$(cat <<'BODY' +[comment body here — see template] +BODY +)" +``` + +### Warning comment template + +```markdown +Thank you for your interest in Apache <PROJECT>. Unfortunately this PR +cannot be accepted in its current form. + +**Structural issues detected:** + +[List each fired signal as a plain-English sentence. Example:] + +- The `team_project/` directory appears to be a student class project + unrelated to Apache <PROJECT>. +- The PR bundles several independent changes with no shared purpose. +- The PR description does not explain what problem the changes solve + or reference an upstream issue. + +**What to do instead:** + +1. Remove any files that are not genuine upstream contributions. +2. Split the remaining changes into separate, focused PRs — one PR + per logical change. +3. Each PR should include a clear description of the problem it + solves and a reference to the relevant upstream issue (or a + justification if no issue exists). +4. Please read the [contribution guidelines](<contributing-docs-url>) + before opening a new PR. + +We welcome genuine contributions and are happy to help if you have +questions about the process. + +If you believe this assessment is incorrect and your changes are a +genuine upstream contribution, please reply to this comment explaining +the purpose of your PR and a maintainer will take another look. + +--- +*Drafted by an AI-assisted tool; reviewed by @<viewer> before posting.* Review Comment: **One-off attribution footer.** Golden rule 5 / `posting.md` define the canonical `<ai_attribution_footer>` block; this is a separate string that can drift. Reuse the shared one: ```suggestion Use the canonical `<ai_attribution_footer>` block from posting.md (Golden rule 5) here, rather than a one-off string. ``` ########## skills/pr-management-code-review/slop-detection.md: ########## @@ -0,0 +1,267 @@ +<!-- SPDX-License-Identifier: Apache-2.0 + https://www.apache.org/licenses/LICENSE-2.0 --> + +# Slop detection — structural scan + +This step runs immediately after Step 2 (diff and metadata fetched), +before the full line-by-line review in Step 3. It is cheap — purely +structural analysis, no LLM token budget — and it short-circuits the +review when a PR is clearly not a genuine upstream contribution. + +"Slop" here means a PR whose structure demonstrates it is a **class +project, personal experiment, or low-effort AI-generated submission** +being pushed into the upstream repository. The goal is to catch +crystal-clear cases early, not to flag every imperfect PR. When in +doubt, proceed with the normal review. + +--- + +## Signals + +Signals are split into **hard** (individually strong) and **soft** +(individually weak; accumulate). Most checks use only data already +in the Step 2 payload. H1 detects new standalone directories from +the cached unified diff (`new file mode` / `--- /dev/null` headers) +and `files[].path` — no `changeType` field, no base-ref tree lookup. +H2 matches full-URL fork references in the PR body (no +issue-resolution API call needed). H3–H5 and S2–S5 are fully +derivable from the Step 2 payload with no extra `gh` calls. S1 uses +the PR title from the Step 1 working-list cache. + +### Hard signals + +Each hard signal alone has a moderate probability of indicating slop; +two or more together are nearly conclusive. + +| ID | Signal | How to detect | +|---|---|---| +| H1 | **New standalone top-level directory** | The cached unified diff contains `+++ b/<dir>/...` entries where every path shares a single first-level directory prefix AND every file in that prefix appears as a new file in the diff (signalled by `new file mode` or `--- /dev/null` headers), AND the directory contains a project-root file at its first level (`README.md`, `pyproject.toml`, `package.json`, `go.mod`, `pom.xml`, etc.), AND the directory name and/or any README within it suggest it is an independent project unrelated to the upstream codebase. Detection uses the cached unified diff and `files[].path` — no `changeType` field, no base-ref tree lookup. | +| H2 | **Private-fork issue URL in PR body** | The body contains a full GitHub issue or PR URL pointing to a repository that is not the upstream repo — pattern: `https://github.com/<author>/<repo-name>/(issues\|pull)/\d+` where `<repo-name>` differs from the upstream repo. Match against the raw body string. Do not attempt to resolve bare `#N` references; only flag explicit fork URLs. | +| H3 | **Fork merge-commit flood** | The commit list contains 3+ commit messages matching `^Merge (pull request|branch) #\d+ from` that all share the same fork prefix and were authored within a narrow window (< 60 minutes apart). | +| H4 | **Multi-author team project** | Commits are authored by 3 or more distinct GitHub logins, yet the PR is opened by a single account — typical of a university team pushing their entire fork history. | +| H5 | **Area sprawl** | Changed files span 5 or more distinct top-level directories (or well-known project sub-areas) with no discernible semantic relationship. Count using the first two path components of each changed file. | + +### Soft signals + +| ID | Signal | How to detect | +|---|---|---| +| S1 | **Ticket-style PR title** | Title matches patterns like `[Ticket #N]`, `ts/ticket-\d+`, `sprint-N`, `task-\d+`, or contains a student name followed by a ticket reference. | +| S2 | **Template-only PR body** | Body contains no prose beyond the PR template boilerplate (checked: no description above the first `---`, no non-template `closes:` / `related:` references to the upstream repo). | +| S3 | **No real CI** | `statusCheckRollup` contains only external bots (e.g. Mergeable, WIP, boring-cyborg) and zero entries from the project's own CI workflows. | Review Comment: **S3 may fire on exactly the newcomers we don't want to over-correct.** GitHub holds workflows pending approval for first-time contributors, so a genuine newcomer PR often shows zero CI runs only because nobody approved the run. Distinguishing pending from "only bots ran": ```suggestion | S3 | **No real CI** | `statusCheckRollup` contains only external bots (e.g. Mergeable, WIP, boring-cyborg) and zero entries from the project's own CI workflows. Treat an empty or pending rollup (common when GitHub holds workflows awaiting maintainer approval for first-time contributors) as inconclusive, not as a fired signal. | ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
