As I think about the overarching architecture I see that RAT does not have a construct that says "this license requires this new check" (e.g. this license requires a check in notices).
What we may need is a way within the license definition to reference another scanner to ensure that it runs after the current scanner. To do something like this we would need to - modify the IHeader to include the document name. - modify the ILicense to include a list of functions to call if the match is made. On Tue, Nov 11, 2025 at 6:53 AM Robert Stupp <[email protected]> wrote: > Oh yea, regex can be extremely slow. My very first attempt was to > convert the whole SPDX templates into j.u.Pattern instances, which > worked ... but some threw StackOverflowException due to the sheer > complexity of the regexes. The approach that worked was to use a > "String.indexOf()" approach on the (long enough) text parts - and only > then start using j.u.r.Pattern. Works for both header texts and even > complex license texts - the most complex one I could find was [1]. But > more important are distinctions like "GPL-2.0-only" vs "GPL-2.0-only > WITH Classpath-exception-2.0". > The code has a check for 'SPDX-License-Identifier' as well, but as a > fallback, if no license match was found. The reason for this is > because some (composite) license texts contain multiple > 'SPDX-License-Identifier' markers, which is then ambiguous. Another > reason why the 'SPDX-License-Identifier' check is a fallback is the > extraction of (C) attribution, which is just best effort. > > The set of licenses and license-exceptions to check can certainly be > restricted or even extended. Many of the 700+ licenses that SPDX > defines are likely not widely used. > It might also be worth (as an optimization) to check licenses in the > order of their usage. > > Long story short: I think both approaches do pretty much the same thing: > 1. Have a set of licenses to check against > 2. Have a function that takes a string and returns a list of licenses. > > Notices are a very different thing IMO and, I totally agree, a separate > effort. > > Let me take a look at how it could look like in RAT. > > [1] https://github.com/google/j2objc/blob/master/LICENSE has: > Apache-2.0, BSD-3-Clause, GPL-2.0-only WITH Classpath-exception-2.0, > APSL-2.0, ICU, NAIST-2003 and some more > > > On Mon, Nov 10, 2025 at 2:44 PM Claude Warren <[email protected]> wrote: > > > > Robert, > > > > Is this to replace the SPDX matcher we currently have? > > > > If so, can you see how to extend the current matcher with your code? > > > > The current matcher uses regular expressions to extract the patterns. > > Since Regex is expensive it first checks for the presence of the SPDX > > licence identifier in the text. If a match is found the patterns are > > extracted and > > all of the created matchers are checked. This occurs the first time a > > document is checked for an SPDX license. > > > > The second time a document is checked for an SPDX license the list of > > matching licenses is examined and the result returned. > > > > In this way we do the SPDX analysis of the document once and then > reference > > the results as other license tests are executed. > > > > The notice detection should be a different module, and we don't have that > > design implemented yet. Basically RAT does license header checking now. > > Consider that a module. > > > > Notice checking would be different module. It would have a filter that > > limits the tests to only the notice files. It would probably have to run > > after the basic RAT scan so some sort of accumulate and execute strategy > to > > gather the names of the files on the first part and then execute the test > > after. > > > > I think both parts would be excellent advances for RAT. If this is not > > what you have in mind, please restate your idea. > > > > > > > > On Sun, Nov 9, 2025 at 8:49 AM Robert Stupp <[email protected]> wrote: > > > > > Hi, > > > > > > As far as I understand from RAT-331 [1], RAT shall be able to detect > > > "all" licenses via both license-headers and full license texts. The > > > "umbrella" RAT-460 [2] seems related. > > > > > > Background: A couple of months ago, I started an effort to assist > > > projects to get the LICENSE/NOTICE texts "for all the use cases" > > > (module jars, source tarball, binary distribution bundles, containers, > > > initially for the Java/Maven ecosystem but extensibile for other > > > ecosystems) automatically generated. That effort required a > > > functionality to detect SPDX licenses emitting a list of match-tuples > > > of (SPDX license-IDs + SPDX exception IDs) from license text files. > > > The SPDX detection works smoothly and quickly. But I could not find a > > > way to _correctly_ detect the license+notice of all the dependencies > > > (lots of reasons), which is essential to generate those files. I > > > suspect that dependencies would have to provide this information via > > > SBOMs, but this only seems to work for the LICENSE, not the NOTICE. > > > > > > The detection code uses the template definitions [3] in the SPDX > > > license + exception detail objects, against normalized text (conforms > > > to the SPDX license match guidelines [4]). > > > Matching works against all currently defined [5] license details (712 > > > total, 92 of those define a standard header) and exception details > > > (81). Attributions (copyright parts of license texts) are yielded as > > > well. > > > I implemented this detection on my own, because at that time I wanted > > > to have something that works against all SPDX definitions, is fast > > > (startup/initialization + detection) and allows the addition of > > > user-provided license+exception details. > > > > > > Would this SPDX license text + license-header text detection code be a > > > useful addition to the RAT project? > > > > > > Robert > > > > > > [1] https://issues.apache.org/jira/browse/RAT-331 > > > [2] https://issues.apache.org/jira/browse/RAT-460 > > > [3] > > > > https://github.com/spdx/license-list-data/blob/d8b92c55d6e67244e00dc8c18a7dc23a4a463b65/json/details/Apache-2.0.json#L5-L6 > > > [4] > > > > https://spdx.github.io/spdx-spec/v3.0.1/annexes/license-matching-guidelines-and-templates/ > > > [5] https://github.com/spdx/license-list-data/tree/main/json > > > > > > > > > -- > > LinkedIn: http://www.linkedin.com/in/claudewarren > -- LinkedIn: http://www.linkedin.com/in/claudewarren
