Hi Tristan

Thanks for your comments. Your main suggestion sounds good, but I don't follow all of
the logic.

For clarity, I'll call my existing approach to license checking an "external" approach, since it's an external script that isn't part of BuildStream itself, whereas your
proposal would be an "internal" approach.

The internal approach sounds like an excellent proposal, and as I understand it you're suggesting that both approaches have valid use cases, and both approaches should be developed. (With the external approach presumably being developed first, since the internal approach is currently blocked.) This makes sense, but if I understand you
correctly you're also saying:

1. The internal approach will implement blacklist processing.
2. ...and therefore the external approach shouldn't.

That's the part that confuses me. I don't understand how the one thing implies the other.

If only one of the approaches is worth using, then we should only develop that approach and not develop the other. But if both approaches are worth having, then we should develop both approaches properly. There's no reason to deliberately limit the functionality of one approach, by removing an obvious and reasonable feature.

Identifying blacklist violations does seem like an obvious feature for any license- checking approach. So far, most of the people I've spoken to about the external script seems to have assumed it will be used this way; to monitor which licenses apply to the code in their BuildStream project, and make sure that doesn't include any licenses that
ought to be avoided. That effectively means checking against a blacklist.

The external script produces two main summary outputs: a json output for machine processing, and an html output for human reading. The main use case I see for the human-readable output is for users to skim through it, looking for anything out of place or surprising. This is a perfect place for blacklist processing: violations could be highlighted at the top of the page, where they would be visible in one glance.

Likewise, we're suggesting that the script should be used in CI for projects like freedesktop-sdk. If the script includes blacklist processing, then it can cause CI pipelines to fail when a violation is detected; that's a useful feature. Without blacklist processing, the script would just create output artifacts that people probably
won't remember to look at. I don't see what value that adds to CI.

Honestly, I think blacklist processing could be the most valuable feature of the external script. If we don't include it, then I'm not sure I understand what the
external script is supposed to be for.


Douglas

On 27/08/2020 12:18, Tristan Van Berkom wrote:
Hi,

Forking this thread because I think this needs a wider discussion
outside of the scope of this license checker tool.

Also: Cross posting this to the BuildStream dev list as I think this is
quite relevant there. Here is a link to the freedesktop-sdk thread for
reference:

     
https://lists.freedesktop.org/archives/freedesktop-sdk/2020-August/000054.html

On Tue, 2020-08-25 at 20:22 +0100, Douglas Winship wrote:
Following on from the previous email, I've put together a basic
license-checker in python and tested it in a CI Pipeline. I'd be very
interested to get feedback on the html and json output.

In particular I'd be interested to get opinions about how to
implement the blacklist: we're planning to design the license checker
with a blacklist option, where users can supply a list of blacklisted
licenses (possibly as regular expressions). If any blacklisted
licenses are detected, these would be reported in the html and json
outputs, but I'm not sure what form that ought to take.
First, I think blacklisting of the licenses should be out of scope for
this script, which essentially will scan source code and give us
summary feedback of detected licenses (and as such, provides valuable
input for project maintainers in other stages).


Here is how I would envision a workflow which involves reliable checks
and blacklisting, I will describe this in two sections since I only
recently became aware of the benefits we can gain with SPDX[0].


Traditional approach
~~~~~~~~~~~~~~~~~~~~
Traditionally linux distributions need to audit and consciously
understand what rights they have for every given module they distribute
in binary form, and then make a conscious decision under which license
they distribute those binaries (in the cases where the upstream module
is dual licensed and provides some choice to the distribution).

Binary package based distributions like rpm or deb packages, often
encode this decision into the package metadata, custom linux
integration tools like buildroot and yocto do the same. E.g. yocto has
the LICENSE[1] variable which is manually encoded into all of the
recipes in the poky distribution, users of the poky distribution (who
typically /derive/ poky to create something custom), can then set the
INCOMPATIBLE_LICENSE[2] variable for their distribution, which will
cause build errors if their distribution every inadvertently tries to
include a module with a license on their decided blacklist.

For a vast portion of open source / free software available in the
wild, this conscious interpretation and decision needs to be made by a
human being.

I would see this implemented in BuildStream in the following way:

   * Declare a new "licenses" public data format in the bst public data
     domain[3]

     This is a place where BuildStream project maintainers can record
     the decided license for the module being built, similar to yocto's
     LICENSE variable[1].

     For compatibility across tooling, and consideration of possible
     further automation (see further below), we should probably assert
     that these license annotations be valid SPDX license
     identifiers[4].

   * We would add a new Element plugin in BuildStream, and call it
     something like `assertlicense`

     In this element's `config`, it would allow the user to declare
     a blacklist.

     This element could output a manifest of licenses in the artifact,
     or produce no output at all, the important part is that this
     element can be added to the pipeline, depend on some elements,
     and halt the build with an error in the case that invalid
     licenses are detected.


Enhanced approach
~~~~~~~~~~~~~~~~~
 From my limited understanding, SPDX now provides a format for upstream
project maintainers to encode machine readable information, including
"license expressions" in an "spdx" file in their module.

This would allow for a (possibly weaker possibly stronger) trust chain
where the distributor places trust in the upstream module maintainer to
have the spdx file up to date, if that upstream does maintain one (I
suspect that depending on the use cases, a full license audit will
still be preferred).

This allows us some room to maneuver, and provide automation in the
cases where an upstream provides an spdx file. One downside I can see
from a quick blog read[5]:

     "The SPDX specification doesn't specify a file extension or file
      naming convention."

If this is true, then we would *still* need project maintainers to at
least annotate their element declarations with a bit of public data
which tell us what file is the SPDX file.

An implementation which seems suitable to me for this, building on top
of the previous "Traditional approach" would look like this:

   * Block on the ability to have elements depend on the sources of
     their dependencies in BuildStream, or another solution to the
     same problem.

     As discussed in a recent thread[6], there are already a few
     use cases needing similar capability, including the Bazel
     build plugin which wants to stage many dependency sources
     in one sandbox.

   * With the ability to depend on dependency source availability
     at build time, the new `assertlicense` Element plugin could
     have the ability to:

     * Depend on some SPDX parsing tooling, which it could stage
       in the `/` of the sandbox.

     * Stage sources for any of the dependency elements which do
       not already list manually specified licenses in their
       public data.

     * Attempt to scan the code for an spdx file.

     In this way the license assertion could be made based both
     on manually specified licenses (for any modules which do not
     export any SPDX file), and can be automated for modules which
     provide the SPDX file.


Summary
~~~~~~~
I think that the license checker script has value on it's own, as it
provides some automated feedback for those actors who need to audit the
distribution and understand what it is they are distributing, but by
itself is not the ultimately suitable place to add blacklist
assertions.

Any thoughts on the above approaches for general license metadata
checking ?


Cheers,
     -Tristan


PS: Please note that there is *another* problem related to licenses,
and that is the actually *distribution* of license files themselves,
e.g. it can be desirable to publish the COPYING/LICENSE files found in
upstream modules in the artifact payloads somewhere so that they can be
handed over at the distribution phase - the entire text above does not
address this bit, and I think it is yet another separate problem.


[0]: https://spdx.dev/
[1]: 
https://www.yoctoproject.org/docs/latest/ref-manual/ref-manual.html#var-LICENSE
[2]: 
https://www.yoctoproject.org/docs/latest/ref-manual/ref-manual.html#var-INCOMPATIBLE_LICENSE
[3]: 
https://docs.buildstream.build/master/format_public.html#builtin-public-data
[4]: https://spdx.org/licenses/
[5]: https://github.com/david-a-wheeler/spdx-tutorial
[6]: 
https://lists.apache.org/thread.html/r3ff35d36e085d1ca51f753707b24ac5e3111b5b53d74807085076033%40%3Cdev.buildstream.apache.org%3E



Reply via email to