Re: SPDX license and header text matching

Robert Stupp Mon, 10 Nov 2025 22:53:39 -0800

Oh yea, regex can be extremely slow. My very first attempt was to
convert the whole SPDX templates into j.u.Pattern instances, which
worked ... but some threw StackOverflowException due to the sheer
complexity of the regexes. The approach that worked was to use a
"String.indexOf()" approach on the (long enough) text parts - and only
then start using j.u.r.Pattern. Works for both header texts and even
complex license texts - the most complex one I could find was [1]. But
more important are distinctions like "GPL-2.0-only" vs "GPL-2.0-only
WITH Classpath-exception-2.0".
The code has a check for 'SPDX-License-Identifier' as well, but as a
fallback, if no license match was found. The reason for this is
because some (composite) license texts contain multiple
'SPDX-License-Identifier' markers, which is then ambiguous. Another
reason why the 'SPDX-License-Identifier' check is a fallback is the
extraction of (C) attribution, which is just best effort.


The set of licenses and license-exceptions to check can certainly be
restricted or even extended. Many of the 700+ licenses that SPDX
defines are likely not widely used.
It might also be worth (as an optimization) to check licenses in the
order of their usage.

Long story short: I think both approaches do pretty much the same thing:
1. Have a set of licenses to check against
2. Have a function that takes a string and returns a list of licenses.

Notices are a very different thing IMO and, I totally agree, a separate effort.

Let me take a look at how it could look like in RAT.

[1] https://github.com/google/j2objc/blob/master/LICENSE has:
Apache-2.0, BSD-3-Clause, GPL-2.0-only WITH Classpath-exception-2.0,
APSL-2.0, ICU, NAIST-2003 and some more


On Mon, Nov 10, 2025 at 2:44 PM Claude Warren <[email protected]> wrote:
>
> Robert,
>
> Is this to replace the SPDX matcher we currently have?
>
> If so, can you see how to extend the current matcher with your code?
>
> The current matcher uses regular expressions to extract the patterns.
> Since Regex is expensive it first checks for the presence of the SPDX
> licence identifier in the text.  If a match is found the patterns are
> extracted and
> all of the created matchers are checked.  This occurs the first time a
> document is checked for an SPDX license.
>
> The second time a document is checked for an SPDX license the list of
> matching licenses is examined and the result returned.
>
> In this way we do the SPDX analysis of the document once and then reference
> the results as other license tests are executed.
>
> The notice detection should be a different module, and we don't have that
> design implemented yet.   Basically RAT does license header checking now.
> Consider that a module.
>
> Notice checking would be different module.  It would have a filter that
> limits the tests to only the notice files.  It would probably have to run
> after the basic RAT scan so some sort of accumulate and execute strategy to
> gather the names of the files on the first part and then execute the test
> after.
>
> I think both parts would be excellent advances for RAT.  If this is not
> what you have in mind, please restate your idea.
>
>
>
> On Sun, Nov 9, 2025 at 8:49 AM Robert Stupp <[email protected]> wrote:
>
> > Hi,
> >
> > As far as I understand from RAT-331 [1], RAT shall be able to detect
> > "all" licenses via both license-headers and full license texts. The
> > "umbrella" RAT-460 [2] seems related.
> >
> > Background: A couple of months ago, I started an effort to assist
> > projects to get the LICENSE/NOTICE texts "for all the use cases"
> > (module jars, source tarball, binary distribution bundles, containers,
> > initially for the Java/Maven ecosystem but extensibile for other
> > ecosystems) automatically generated. That effort required a
> > functionality to detect SPDX licenses emitting a list of match-tuples
> > of (SPDX license-IDs + SPDX exception IDs) from license text files.
> > The SPDX detection works smoothly and quickly. But I could not find a
> > way to _correctly_ detect the license+notice of all the dependencies
> > (lots of reasons), which is essential to generate those files. I
> > suspect that dependencies would have to provide this information via
> > SBOMs, but this only seems to work for the LICENSE, not the NOTICE.
> >
> > The detection code uses the template definitions [3] in the SPDX
> > license + exception detail objects, against normalized text (conforms
> > to the SPDX license match guidelines [4]).
> > Matching works against all currently defined [5] license details (712
> > total, 92 of those define a standard header) and exception details
> > (81). Attributions (copyright parts of license texts) are yielded as
> > well.
> > I implemented this detection on my own, because at that time I wanted
> > to have something that works against all SPDX definitions, is fast
> > (startup/initialization + detection) and allows the addition of
> > user-provided license+exception details.
> >
> > Would this SPDX license text + license-header text detection code be a
> > useful addition to the RAT project?
> >
> > Robert
> >
> > [1] https://issues.apache.org/jira/browse/RAT-331
> > [2] https://issues.apache.org/jira/browse/RAT-460
> > [3]
> > https://github.com/spdx/license-list-data/blob/d8b92c55d6e67244e00dc8c18a7dc23a4a463b65/json/details/Apache-2.0.json#L5-L6
> > [4]
> > https://spdx.github.io/spdx-spec/v3.0.1/annexes/license-matching-guidelines-and-templates/
> > [5] https://github.com/spdx/license-list-data/tree/main/json
> >
>
>
> --
> LinkedIn: http://www.linkedin.com/in/claudewarren

Re: SPDX license and header text matching

Reply via email to