Hi Tristan
Thanks for your comments. Your main suggestion sounds good, but I don't
follow all of
the logic.
For clarity, I'll call my existing approach to license checking an
"external" approach,
since it's an external script that isn't part of BuildStream itself,
whereas your
proposal would be an "internal" approach.
The internal approach sounds like an excellent proposal, and as I
understand it you're
suggesting that both approaches have valid use cases, and both
approaches should be
developed. (With the external approach presumably being developed first,
since the
internal approach is currently blocked.) This makes sense, but if I
understand you
correctly you're also saying:
1. The internal approach will implement blacklist processing.
2. ...and therefore the external approach shouldn't.
That's the part that confuses me. I don't understand how the one thing
implies the other.
If only one of the approaches is worth using, then we should only
develop that approach
and not develop the other. But if both approaches are worth having, then
we should
develop both approaches properly. There's no reason to deliberately
limit the
functionality of one approach, by removing an obvious and reasonable
feature.
Identifying blacklist violations does seem like an obvious feature for
any license-
checking approach. So far, most of the people I've spoken to about the
external script
seems to have assumed it will be used this way; to monitor which
licenses apply to the
code in their BuildStream project, and make sure that doesn't include
any licenses that
ought to be avoided. That effectively means checking against a blacklist.
The external script produces two main summary outputs: a json output for
machine
processing, and an html output for human reading. The main use case I
see for the
human-readable output is for users to skim through it, looking for
anything out of place
or surprising. This is a perfect place for blacklist processing:
violations could be
highlighted at the top of the page, where they would be visible in one
glance.
Likewise, we're suggesting that the script should be used in CI for
projects like
freedesktop-sdk. If the script includes blacklist processing, then it
can cause CI
pipelines to fail when a violation is detected; that's a useful feature.
Without
blacklist processing, the script would just create output artifacts that
people probably
won't remember to look at. I don't see what value that adds to CI.
Honestly, I think blacklist processing could be the most valuable
feature of the
external script. If we don't include it, then I'm not sure I understand
what the
external script is supposed to be for.
Douglas
On 27/08/2020 12:18, Tristan Van Berkom wrote:
Hi,
Forking this thread because I think this needs a wider discussion
outside of the scope of this license checker tool.
Also: Cross posting this to the BuildStream dev list as I think this is
quite relevant there. Here is a link to the freedesktop-sdk thread for
reference:
https://lists.freedesktop.org/archives/freedesktop-sdk/2020-August/000054.html
On Tue, 2020-08-25 at 20:22 +0100, Douglas Winship wrote:
Following on from the previous email, I've put together a basic
license-checker in python and tested it in a CI Pipeline. I'd be very
interested to get feedback on the html and json output.
In particular I'd be interested to get opinions about how to
implement the blacklist: we're planning to design the license checker
with a blacklist option, where users can supply a list of blacklisted
licenses (possibly as regular expressions). If any blacklisted
licenses are detected, these would be reported in the html and json
outputs, but I'm not sure what form that ought to take.
First, I think blacklisting of the licenses should be out of scope for
this script, which essentially will scan source code and give us
summary feedback of detected licenses (and as such, provides valuable
input for project maintainers in other stages).
Here is how I would envision a workflow which involves reliable checks
and blacklisting, I will describe this in two sections since I only
recently became aware of the benefits we can gain with SPDX[0].
Traditional approach
~~~~~~~~~~~~~~~~~~~~
Traditionally linux distributions need to audit and consciously
understand what rights they have for every given module they distribute
in binary form, and then make a conscious decision under which license
they distribute those binaries (in the cases where the upstream module
is dual licensed and provides some choice to the distribution).
Binary package based distributions like rpm or deb packages, often
encode this decision into the package metadata, custom linux
integration tools like buildroot and yocto do the same. E.g. yocto has
the LICENSE[1] variable which is manually encoded into all of the
recipes in the poky distribution, users of the poky distribution (who
typically /derive/ poky to create something custom), can then set the
INCOMPATIBLE_LICENSE[2] variable for their distribution, which will
cause build errors if their distribution every inadvertently tries to
include a module with a license on their decided blacklist.
For a vast portion of open source / free software available in the
wild, this conscious interpretation and decision needs to be made by a
human being.
I would see this implemented in BuildStream in the following way:
* Declare a new "licenses" public data format in the bst public data
domain[3]
This is a place where BuildStream project maintainers can record
the decided license for the module being built, similar to yocto's
LICENSE variable[1].
For compatibility across tooling, and consideration of possible
further automation (see further below), we should probably assert
that these license annotations be valid SPDX license
identifiers[4].
* We would add a new Element plugin in BuildStream, and call it
something like `assertlicense`
In this element's `config`, it would allow the user to declare
a blacklist.
This element could output a manifest of licenses in the artifact,
or produce no output at all, the important part is that this
element can be added to the pipeline, depend on some elements,
and halt the build with an error in the case that invalid
licenses are detected.
Enhanced approach
~~~~~~~~~~~~~~~~~
From my limited understanding, SPDX now provides a format for upstream
project maintainers to encode machine readable information, including
"license expressions" in an "spdx" file in their module.
This would allow for a (possibly weaker possibly stronger) trust chain
where the distributor places trust in the upstream module maintainer to
have the spdx file up to date, if that upstream does maintain one (I
suspect that depending on the use cases, a full license audit will
still be preferred).
This allows us some room to maneuver, and provide automation in the
cases where an upstream provides an spdx file. One downside I can see
from a quick blog read[5]:
"The SPDX specification doesn't specify a file extension or file
naming convention."
If this is true, then we would *still* need project maintainers to at
least annotate their element declarations with a bit of public data
which tell us what file is the SPDX file.
An implementation which seems suitable to me for this, building on top
of the previous "Traditional approach" would look like this:
* Block on the ability to have elements depend on the sources of
their dependencies in BuildStream, or another solution to the
same problem.
As discussed in a recent thread[6], there are already a few
use cases needing similar capability, including the Bazel
build plugin which wants to stage many dependency sources
in one sandbox.
* With the ability to depend on dependency source availability
at build time, the new `assertlicense` Element plugin could
have the ability to:
* Depend on some SPDX parsing tooling, which it could stage
in the `/` of the sandbox.
* Stage sources for any of the dependency elements which do
not already list manually specified licenses in their
public data.
* Attempt to scan the code for an spdx file.
In this way the license assertion could be made based both
on manually specified licenses (for any modules which do not
export any SPDX file), and can be automated for modules which
provide the SPDX file.
Summary
~~~~~~~
I think that the license checker script has value on it's own, as it
provides some automated feedback for those actors who need to audit the
distribution and understand what it is they are distributing, but by
itself is not the ultimately suitable place to add blacklist
assertions.
Any thoughts on the above approaches for general license metadata
checking ?
Cheers,
-Tristan
PS: Please note that there is *another* problem related to licenses,
and that is the actually *distribution* of license files themselves,
e.g. it can be desirable to publish the COPYING/LICENSE files found in
upstream modules in the artifact payloads somewhere so that they can be
handed over at the distribution phase - the entire text above does not
address this bit, and I think it is yet another separate problem.
[0]: https://spdx.dev/
[1]:
https://www.yoctoproject.org/docs/latest/ref-manual/ref-manual.html#var-LICENSE
[2]:
https://www.yoctoproject.org/docs/latest/ref-manual/ref-manual.html#var-INCOMPATIBLE_LICENSE
[3]:
https://docs.buildstream.build/master/format_public.html#builtin-public-data
[4]: https://spdx.org/licenses/
[5]: https://github.com/david-a-wheeler/spdx-tutorial
[6]:
https://lists.apache.org/thread.html/r3ff35d36e085d1ca51f753707b24ac5e3111b5b53d74807085076033%40%3Cdev.buildstream.apache.org%3E