As I think about the overarching architecture I see that RAT does not have
a construct that says "this license requires this new check"  (e.g. this
license requires a check in notices).

What we may need is a way within the license definition to
reference another scanner to ensure that it runs after the current
scanner.  To do something like this we would need to

   - modify the IHeader to include the document name.
   - modify the ILicense to include a list of functions to call if the
   match is made.





On Tue, Nov 11, 2025 at 6:53 AM Robert Stupp <[email protected]> wrote:

> Oh yea, regex can be extremely slow. My very first attempt was to
> convert the whole SPDX templates into j.u.Pattern instances, which
> worked ... but some threw StackOverflowException due to the sheer
> complexity of the regexes. The approach that worked was to use a
> "String.indexOf()" approach on the (long enough) text parts - and only
> then start using j.u.r.Pattern. Works for both header texts and even
> complex license texts - the most complex one I could find was [1]. But
> more important are distinctions like "GPL-2.0-only" vs "GPL-2.0-only
> WITH Classpath-exception-2.0".
> The code has a check for 'SPDX-License-Identifier' as well, but as a
> fallback, if no license match was found. The reason for this is
> because some (composite) license texts contain multiple
> 'SPDX-License-Identifier' markers, which is then ambiguous. Another
> reason why the 'SPDX-License-Identifier' check is a fallback is the
> extraction of (C) attribution, which is just best effort.
>
> The set of licenses and license-exceptions to check can certainly be
> restricted or even extended. Many of the 700+ licenses that SPDX
> defines are likely not widely used.
> It might also be worth (as an optimization) to check licenses in the
> order of their usage.
>
> Long story short: I think both approaches do pretty much the same thing:
> 1. Have a set of licenses to check against
> 2. Have a function that takes a string and returns a list of licenses.
>
> Notices are a very different thing IMO and, I totally agree, a separate
> effort.
>
> Let me take a look at how it could look like in RAT.
>
> [1] https://github.com/google/j2objc/blob/master/LICENSE has:
> Apache-2.0, BSD-3-Clause, GPL-2.0-only WITH Classpath-exception-2.0,
> APSL-2.0, ICU, NAIST-2003 and some more
>
>
> On Mon, Nov 10, 2025 at 2:44 PM Claude Warren <[email protected]> wrote:
> >
> > Robert,
> >
> > Is this to replace the SPDX matcher we currently have?
> >
> > If so, can you see how to extend the current matcher with your code?
> >
> > The current matcher uses regular expressions to extract the patterns.
> > Since Regex is expensive it first checks for the presence of the SPDX
> > licence identifier in the text.  If a match is found the patterns are
> > extracted and
> > all of the created matchers are checked.  This occurs the first time a
> > document is checked for an SPDX license.
> >
> > The second time a document is checked for an SPDX license the list of
> > matching licenses is examined and the result returned.
> >
> > In this way we do the SPDX analysis of the document once and then
> reference
> > the results as other license tests are executed.
> >
> > The notice detection should be a different module, and we don't have that
> > design implemented yet.   Basically RAT does license header checking now.
> > Consider that a module.
> >
> > Notice checking would be different module.  It would have a filter that
> > limits the tests to only the notice files.  It would probably have to run
> > after the basic RAT scan so some sort of accumulate and execute strategy
> to
> > gather the names of the files on the first part and then execute the test
> > after.
> >
> > I think both parts would be excellent advances for RAT.  If this is not
> > what you have in mind, please restate your idea.
> >
> >
> >
> > On Sun, Nov 9, 2025 at 8:49 AM Robert Stupp <[email protected]> wrote:
> >
> > > Hi,
> > >
> > > As far as I understand from RAT-331 [1], RAT shall be able to detect
> > > "all" licenses via both license-headers and full license texts. The
> > > "umbrella" RAT-460 [2] seems related.
> > >
> > > Background: A couple of months ago, I started an effort to assist
> > > projects to get the LICENSE/NOTICE texts "for all the use cases"
> > > (module jars, source tarball, binary distribution bundles, containers,
> > > initially for the Java/Maven ecosystem but extensibile for other
> > > ecosystems) automatically generated. That effort required a
> > > functionality to detect SPDX licenses emitting a list of match-tuples
> > > of (SPDX license-IDs + SPDX exception IDs) from license text files.
> > > The SPDX detection works smoothly and quickly. But I could not find a
> > > way to _correctly_ detect the license+notice of all the dependencies
> > > (lots of reasons), which is essential to generate those files. I
> > > suspect that dependencies would have to provide this information via
> > > SBOMs, but this only seems to work for the LICENSE, not the NOTICE.
> > >
> > > The detection code uses the template definitions [3] in the SPDX
> > > license + exception detail objects, against normalized text (conforms
> > > to the SPDX license match guidelines [4]).
> > > Matching works against all currently defined [5] license details (712
> > > total, 92 of those define a standard header) and exception details
> > > (81). Attributions (copyright parts of license texts) are yielded as
> > > well.
> > > I implemented this detection on my own, because at that time I wanted
> > > to have something that works against all SPDX definitions, is fast
> > > (startup/initialization + detection) and allows the addition of
> > > user-provided license+exception details.
> > >
> > > Would this SPDX license text + license-header text detection code be a
> > > useful addition to the RAT project?
> > >
> > > Robert
> > >
> > > [1] https://issues.apache.org/jira/browse/RAT-331
> > > [2] https://issues.apache.org/jira/browse/RAT-460
> > > [3]
> > >
> https://github.com/spdx/license-list-data/blob/d8b92c55d6e67244e00dc8c18a7dc23a4a463b65/json/details/Apache-2.0.json#L5-L6
> > > [4]
> > >
> https://spdx.github.io/spdx-spec/v3.0.1/annexes/license-matching-guidelines-and-templates/
> > > [5] https://github.com/spdx/license-list-data/tree/main/json
> > >
> >
> >
> > --
> > LinkedIn: http://www.linkedin.com/in/claudewarren
>


-- 
LinkedIn: http://www.linkedin.com/in/claudewarren

Reply via email to