justinmclean opened a new pull request, #154: URL: https://github.com/apache/airflow-steward/pull/154
The existing Step 2a fuzzy-match runs three structured searches (GHSA IDs, code pointers, subject keywords) against existing trackers. These work well when a report carries explicit technical identifiers, but miss the most common real-world duplicate pattern: the same vulnerability reported twice by different people with no shared identifiers, or the same reporter filing again weeks later with different framing. This PR adds two checks that run after the three-key search, triggered only when no STRONG (GHSA) match was already found: Semantic comparison pass — fetches titles and the first 300 characters of every open tracker in a single gh issue list call, produces a root-cause summary from the incoming report, and compares against the corpus on four axes: component/subsystem, bug class, attack path, and fix shape. Two-axis overlap = MEDIUM; three or four axes = STRONG (same weight as a GHSA collision — routes to security-issue-deduplicate rather than creating a new tracker). Reporter-identity check — searches open and recently-closed trackers for the inbound reporter's email local-part. A hit on a related issue counts as MEDIUM even with only one-axis overlap — the primary signal for the same-reporter-different-framing case. The budget guardrail is updated from 5 to 6 gh calls per candidate to account for the new bulk-list and reporter-identity calls, plus up to 3 follow-up full-body reads on the highest-scoring semantic candidates. Testing — three synthetic test cases verified manually: clear duplicate (fires STRONG), false-positive trap with same subsystem but different bug class (correctly suppressed), same reporter with different framing (fires STRONG on axes; identity check fires as supporting signal). skill-validate passes with no violations. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
