Robert,

Is this to replace the SPDX matcher we currently have?

If so, can you see how to extend the current matcher with your code?

The current matcher uses regular expressions to extract the patterns.
Since Regex is expensive it first checks for the presence of the SPDX
licence identifier in the text.  If a match is found the patterns are
extracted and
all of the created matchers are checked.  This occurs the first time a
document is checked for an SPDX license.

The second time a document is checked for an SPDX license the list of
matching licenses is examined and the result returned.

In this way we do the SPDX analysis of the document once and then reference
the results as other license tests are executed.

The notice detection should be a different module, and we don't have that
design implemented yet.   Basically RAT does license header checking now.
Consider that a module.

Notice checking would be different module.  It would have a filter that
limits the tests to only the notice files.  It would probably have to run
after the basic RAT scan so some sort of accumulate and execute strategy to
gather the names of the files on the first part and then execute the test
after.

I think both parts would be excellent advances for RAT.  If this is not
what you have in mind, please restate your idea.



On Sun, Nov 9, 2025 at 8:49 AM Robert Stupp <[email protected]> wrote:

> Hi,
>
> As far as I understand from RAT-331 [1], RAT shall be able to detect
> "all" licenses via both license-headers and full license texts. The
> "umbrella" RAT-460 [2] seems related.
>
> Background: A couple of months ago, I started an effort to assist
> projects to get the LICENSE/NOTICE texts "for all the use cases"
> (module jars, source tarball, binary distribution bundles, containers,
> initially for the Java/Maven ecosystem but extensibile for other
> ecosystems) automatically generated. That effort required a
> functionality to detect SPDX licenses emitting a list of match-tuples
> of (SPDX license-IDs + SPDX exception IDs) from license text files.
> The SPDX detection works smoothly and quickly. But I could not find a
> way to _correctly_ detect the license+notice of all the dependencies
> (lots of reasons), which is essential to generate those files. I
> suspect that dependencies would have to provide this information via
> SBOMs, but this only seems to work for the LICENSE, not the NOTICE.
>
> The detection code uses the template definitions [3] in the SPDX
> license + exception detail objects, against normalized text (conforms
> to the SPDX license match guidelines [4]).
> Matching works against all currently defined [5] license details (712
> total, 92 of those define a standard header) and exception details
> (81). Attributions (copyright parts of license texts) are yielded as
> well.
> I implemented this detection on my own, because at that time I wanted
> to have something that works against all SPDX definitions, is fast
> (startup/initialization + detection) and allows the addition of
> user-provided license+exception details.
>
> Would this SPDX license text + license-header text detection code be a
> useful addition to the RAT project?
>
> Robert
>
> [1] https://issues.apache.org/jira/browse/RAT-331
> [2] https://issues.apache.org/jira/browse/RAT-460
> [3]
> https://github.com/spdx/license-list-data/blob/d8b92c55d6e67244e00dc8c18a7dc23a4a463b65/json/details/Apache-2.0.json#L5-L6
> [4]
> https://spdx.github.io/spdx-spec/v3.0.1/annexes/license-matching-guidelines-and-templates/
> [5] https://github.com/spdx/license-list-data/tree/main/json
>


-- 
LinkedIn: http://www.linkedin.com/in/claudewarren

Reply via email to