I have a proposed change.  See
https://github.com/Claudenw/creadur-rat/pull/6/files
Note that this pull request is the difference between multiple targets and
the change to move to RAT-366 (Move to single matche call)

Example output in
https://github.com/Claudenw/creadur-rat/tree/Multiple_license_report/apache-rat/src/site/examples

I reworked the MetaData class and removed all the funky naming.  All we
really needed to capture for a document is what licenses matched and which
of those are approved licenses.

The new rat report (in examples) has a "resource" element for each file
that was checked.  The resource still has a name attribute and I added a
type attribute that specifies the type of file that it is (e.g. archive,
standard, binary).  It has two possible child elements "license" and
"sample"

The license element has several attributes: approval, family, id, and name
A license can have a notes child element that contains the notes for the
license.  These are not usually displayed but are included for the
generated files license.

The sample element contains text from the license.  It is only included
when the license type is unknown.

The sample and notes text is enclosed in a CDATA block.

I reworked the standard report.  This is probably a breaking change for
anyone who is parsing the text, but then they should be using a custom xslt
to extract the info they want.

The new report looks like:


*****************************************************
Summary
-------
Generated at: 2024-03-29T15:01:24+01:00

Notes: 2
Binaries: 2
Archives: 1
Standards: 8

Apache Licensed: 5
Generated Documents: 1

JavaDocs are generated, thus a license header is optional.
Generated files do not require license headers.

2 Unknown Licenses

*****************************************************

Files with unapproved licenses:

  src/test/resources/elements/Source.java
  src/test/resources/elements/sub/Empty.txt

*****************************************************

*****************************************************
  Documents with unapproved licenses will start with a '!'
  The next character identifies the document type.

   char         type
    a       Archive file
    b       Binary file
    g       Generated file
    n       Notice file
    s       Standard file
    u       Unknown file.

 s src/test/resources/elements/ILoggerFactory.java
    MIT   The MIT License
 b src/test/resources/elements/Image.png
 n src/test/resources/elements/LICENSE
 n src/test/resources/elements/NOTICE
!s src/test/resources/elements/Source.java
    ????? Unknown license
 s src/test/resources/elements/Text.txt
    AL    Apache License Version 2.0
 s src/test/resources/elements/TextHttps.txt
    AL    Apache License Version 2.0
 s src/test/resources/elements/Xml.xml
    AL    Apache License Version 2.0
 s src/test/resources/elements/buildr.rb
    AL    Apache License Version 2.0
 a src/test/resources/elements/dummy.jar
 g src/test/resources/elements/generated.txt
    GEN   Generated Files
 b src/test/resources/elements/plain.json
 s src/test/resources/elements/tri.txt
    AL    Apache License Version 2.0
    BSD-3 BSD 3 clause
    TMF   The Telemanagement Forum License
!s src/test/resources/elements/sub/Empty.txt
    ????? Unknown license

*****************************************************

I think this solves the problem.

Claude

On Thu, Mar 28, 2024 at 10:17 AM Claude Warren <[email protected]> wrote:

> SPDX[1] has an interesting format where they can report 2 (or more?)
> licenses in one.
>
> There are a couple of things here that we will need to look at:
>
>    1. Metadata only stores one matching license.
>    2. Can we modify the output XML to list multiple licenses for a file
>    without too much trouble.  I don't think the existing XLST will
>    have problems with it.
>    3. SPDX [1] has an interesting format where they can report 2 (or
>    more?) licenses in one.  Perhaps we should use their format for license
>    identification.  This would allow us to report the SPDX tags that reference
>    multiple licenses.
>
> Also, everytime I look at the LicenseFamily code I wonder why there is a
> limit of 5 on the number of characters in the license family category.  It
> feels like a formatting issue was pushed into the internal code.  Drives me
> crazy.
>
> [1] https://spdx.dev/learn/handling-license-info/
>
> On Thu, Mar 28, 2024 at 10:01 AM P. Ottlinger <[email protected]>
> wrote:
>
>> Hi,
>>
>> Am 28.03.24 um 09:41 schrieb Claude Warren:
>> > I got back to looking at 366 and discovered a problem that I think has
>> been
>> > lurking in the system for some time.  Basically, if a file has the
>> > signatures for more than one license only one will be reported, and the
>> > selection of which one is (I think) random.
>>
>> thanks for analyzing this issue, which explains some random test
>> failuress ..... :(
>>
>> <snip>
>>
>> > My suggestion is we report all license matches and let the user decide
>> what
>> > to do.
>>
>> I'm in favour of reporting as many licenses as possible, but assume this
>> will break the current report format, that is optimized for one license
>> only.
>>
>> Not sure if downstream users have problems with that change?!
>>
>> Would we have a maximum license number or could this result in an
>> "endless" list of reported licenses, if a file with "all" thinkable
>> license files is provided to RAT? Initially I thought of adding a new
>> analyzer/reporting state "MULTIPLE" that is reported in the scan and a
>> detailed report that lists up to x (maybe 3 or 5?) maximum licenses per
>> file - WDYT?
>>
>> >
>> > My plan is to create a branch that reports multiple matching licenses
>> and
>> > then merge that into RAT-366 to resolve the problem.  This should give
>> us
>> > all a chance to review the change before it gets added to the already
>> large
>> > RAT-366.
>>
>> +1
>>
>> Thanks for your deep dive into RAT!
>>
>> Cheers,
>> Phil
>>
>
>
> --
> LinkedIn: http://www.linkedin.com/in/claudewarren
>


-- 
LinkedIn: http://www.linkedin.com/in/claudewarren

Reply via email to