asf-tooling commented on issue #878:
URL: 
https://github.com/apache/tooling-trusted-releases/issues/878#issuecomment-4407534127

   <!-- gofannon-issue-triage-bot v2 -->
   
   **Automated triage** — analyzed at `main@751c2146`
   
   **Type:** `new_feature`  •  **Classification:** `actionable`  •  
**Confidence:** `medium`
   **Application domain(s):** `automated_checks`
   
   ### Summary
   The issue requests that the lightweight license header checker support 
specifying file exclusions per archive rather than globally across all source 
archives. Currently, `policy_source_excludes_lightweight` is a project-level 
list applied to all archives. @sbp noted in the discussion that ATR's checker 
correctly only checks source archives for the ASF-specific header, but vendored 
files (like those in maven-mvnd donated through IP clearance) need per-archive 
exclusion capability. The issue is explicitly marked as 'not a priority 
requirement because RAT mode is available'. The existing path construction in 
`_headers_check_core_logic` already includes the artifact basename, so 
archive-specific glob patterns could theoretically work already, but there's no 
explicit UI or data model support for per-archive exclusion mappings.
   
   ### Where this lives in the code today
   
   #### `atr/tasks/checks/license.py` — `headers` (lines 180-213)
   _needs modification_
   This is where ignore_lines are fetched from project policy globally. To 
support per-archive exclusions, this function would need to look up exclusions 
specific to the current artifact.
   
   ```python
   async def headers(args: checks.FunctionArguments) -> results.Results | None:
       """Check that all source files have valid license headers."""
       recorder = await args.recorder(CHECK_VERSION_HEADERS)
       if not (artifact_abs_path := await recorder.abs_path()):
           return None
   
       is_source = await recorder.primary_path_is_source()
       if is_source:
           project = await recorder.project()
           if project.policy_license_check_mode == sql.LicenseCheckMode.RAT:
               return None
   
       archive_dir = await checks.resolve_archive_dir(args)
       if archive_dir is None:
           await recorder.failure(
               "Extracted archive tree is not available",
               {"rel_path": args.primary_rel_path},
           )
           return None
   
       log.info(f"Checking license headers for {artifact_abs_path} (rel: 
{args.primary_rel_path})")
   
       project = await recorder.project()
   
       ignore_lines: list[str] = []
       excludes_source: str
       if is_source:
           ignore_lines = project.policy_source_excludes_lightweight
           excludes_source = "policy" if ignore_lines else "none"
       else:
           excludes_source = "none"
   
       artifact_basename = os.path.basename(str(artifact_abs_path))
       return await _headers_core(recorder, archive_dir, artifact_basename, 
ignore_lines, excludes_source)
   ```
   
   #### `atr/storage/readers/checks.py` — `GeneralPublic.ignores_matcher` 
(lines 88-103)
   _currently does this_
   The existing post-hoc ignore system already supports per-path patterns via 
primary_rel_path_glob and member_rel_path_glob, offering a partial workaround 
for per-archive filtering of results after checks run.
   
   ```python
       async def ignores_matcher(
           self,
           project_key: safe.ProjectKey,
       ) -> Callable[[sql.CheckResult], bool]:
           ignores = await self.__data.check_result_ignore(
               project_key=str(project_key),
           ).all()
   
           def match(cr: sql.CheckResult) -> bool:
               for ignore in ignores:
                   if self.__check_ignore_match(cr, ignore):
                       # log.info(f"Ignoring check result {cr} due to ignore 
{ignore}")
                       return True
               return False
   
           return match
   ```
   
   #### `atr/shared/ignores.py` — `AddIgnoreForm` (lines 44-55)
   _currently does this_
   The post-hoc ignore form already supports primary_rel_path_glob which 
includes the archive filename, offering per-archive result filtering as a 
workaround.
   
   ```python
   class AddIgnoreForm(form.Form):
       variant: ADD = form.value(ADD)
       release_glob: str = form.label("Release pattern", default="")
       revision_number: safe.OptionalRevisionNumber = form.label("Revision 
number (literal)", default="")
       checker_glob: str = form.label("Checker pattern", default="")
       primary_rel_path_glob: str = form.label("Primary rel path pattern", 
default="")
       member_rel_path_glob: str = form.label("Member rel path pattern", 
default="")
       status: form.Enum[IgnoreStatus] = form.label(
           "Status",
           widget=form.Widget.SELECT,
       )
       message_glob: str = form.label("Message pattern", default="")
   ```
   
   ### Where new code would go
   - `atr/models/sql.py` — after policy_source_excludes_lightweight field in 
project model
     A new field (e.g., policy_source_excludes_lightweight_per_archive as a 
JSON dict mapping archive glob patterns to exclusion lists) would be needed in 
the Project model.
   - `atr/tasks/checks/license.py` — after line where ignore_lines is assigned 
from project.policy_source_excludes_lightweight
     Logic to look up per-archive exclusions based on the artifact basename and 
merge them with global exclusions.
   
   ### Proposed approach
   There are two approaches to solve this:
   
   1. **Document existing capability**: The `_headers_check_core_logic` already 
prepends the artifact basename to paths before matching (constructing 
`/archive-name/member/path`). Users could already write patterns like 
`/maven-mvnd-1.0.4-src*/build/publish-on-homebrew.sh` in 
`source_excludes_lightweight` to target specific archives. The issue may partly 
be that this isn't documented or clear in the UI.
   
   2. **Add explicit per-archive exclusions**: Add a new project policy field 
(e.g., `policy_source_excludes_lightweight_per_archive`) that maps archive name 
patterns to lists of path exclusions. In the `headers` function, after fetching 
global `ignore_lines`, also look up any per-archive exclusions matching the 
current `artifact_basename` and merge them. This would require a new field in 
the SQL model, a form to manage it in the release policy UI, and updated cache 
invalidation keys.
   
   Given the issue is explicitly 'not a priority' and the existing mechanism 
already supports archive-specific patterns (since paths include the basename), 
the most pragmatic first step would be documenting the existing capability. If 
that's insufficient, the explicit per-archive field would be the proper feature 
addition.
   
   ### Open questions
   - Does `util.create_path_matcher` support globbing that includes the archive 
basename (e.g., `/maven-mvnd-*-src*/build/*.sh`)? If so, the current system may 
already support per-archive exclusions with documentation.
   - What is the exact structure of the Project model's 
`policy_source_excludes_lightweight` field? Is it a simple list[str] or does it 
already have any structure that could accommodate per-archive mappings?
   - Should the solution be a pre-check exclusion (preventing the check from 
running on certain files) or is the post-hoc ignore system (CheckResultIgnore 
with primary_rel_path_glob) sufficient as a workaround?
   - The issue notes 'not a priority requirement because RAT mode is available' 
- should this be deferred until there's more demand?
   
   ### Files examined
   - `atr/tasks/checks/license.py`
   - `atr/shared/ignores.py`
   - `atr/storage/readers/checks.py`
   - `atr/storage/writers/checks.py`
   - `atr/tasks/checks/rat.py`
   - `atr/get/checks.py`
   - `atr/post/ignores.py`
   - `atr/get/ignores.py`
   
   ### Related issues
   This issue appears related to: #1214.
   
   _Both concern overly strict license/file checks that should be configurable 
or skipped for certain file types_
   
   ---
   *Draft from a triage agent. A human reviewer should validate before merging 
any change. The agent did not run tests or verify diffs apply.*


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to