Re: [Openembedded-architecture] Adding more information to the SBOM

Richard Purdie Tue, 20 Sep 2022 06:15:26 -0700

On Mon, 2022-09-19 at 16:20 +0200, Carlo Piana wrote:
> thank you for a well detailed and sensible answer. I certainly cannot
> speak on technical issues, although I can understand there are
> activities which could seriously impact the overall process and need
> to be minimized.
> 
> 
> > On Fri, 2022-09-16 at 17:18 +0200, Alberto Pianon wrote:
> > > Il 2022-09-15 14:16 Richard Purdie wrote:
> > > > 
> > > > For the source issues above it basically it comes down to how much
> > > > "pain" we want to push onto all users for the sake of adding in this
> > > > data. Unfortunately it is data which many won't need or use and
> > > > different legal departments do have different requirements.
> > > 
> > > We didn't paint the overall picture sufficiently well, therefore our
> > > requirements may come across as coming from a particularly pedantic
> > > legal department; my fault :)
> > > 
> > > Oniro is not "yet another commercial Yocto project", we are not a legal
> > > department (even if we are experienced FLOSS lawyers and auditors, the
> > > most prominent of whom is Carlo Piana -- cc'ed -- former general counsel
> > > of FSFE and member of OSI Board).
> > > 
> > > Our rather ambitious goal is not limited to Oniro, and consists in doing
> > > compliance in the open source way and both setting an example and
> > > providing guidance and material for others to benefit from our effort.
> > > Our work will therefore be shared (and possibly improved by others) not
> > > only with Oniro-based projects but also with any Yocto project. Among
> > > other things, the most relevant bit of work that we want to share is
> > > **fully reviewed license information** and other legal metadata about a
> > > whole bunch of open source components commonly used in Yocto projects.
> > 
> > I certainly love the goal. I presume you're going to share your review
> > criteria somehow? There must be some further set of steps,
> > documentation and results beyond what we're discussing here?
> 
> Our mandate (and our own attitude) is precisely to make everything as
> public as possible.
> 
> We have published already about it
> https://gitlab.eclipse.org/eclipse/oniro-compliancetoolchain/toolchain/docs/-/tree/main/audit_workflow
> 
> The entire review process is made using GitLab's issues and will be
> made public.


I need to read into the details but that looks like a great start and
I'm happy to see the process being documented!

Thanks for the link, I'll try and have a read.

> We have only one reservation concerning sensitive material
> in case we found something legally problematic (to comply with
> attorney/client privilege) or security-wise critic (in which case we
> adopt a responsible disclosure principle and embargo some details).

That makes sense, it is a tricky balancing act at times.

> > I think the challenge will be whether you can publish that review with
> > sufficient "proof" that other legal departments can leverage it. I
> > wouldn't underestimate how different the requirements and process can
> > be between different people/teams/companies.
> 
> Speaking from a legal perspective, this is precisely the point. It is
> true that we want to create a curated database of decisions, which as
> any human enterprise is prone to errors and correction and therefore
> we cannot have the last word. However, IF we can at least point to a
> unique artifact and give its exact hash, there will be no need to
> trust us, that would be open to inspection, because everybody else
> could look at the same source we have identified and make sure we
> have extracted all the information.

I do love the idea and I think it is quite possible. I do think this
does lead to one of the key details we need to think about though.

>From a legal perspective I'd imagine you like dealing with a set of
files that make up the source of some piece of software. I'm not going
to use the word "package" since I think the term is overloaded and
confusing. That set of files can all be identified by checksums. This
pushes us towards wanting checksums of every file.

Stepping over to the build world, we have bitbake's fetcher and it
actually requires something similar - any given "input" must be
uniquely identifiable from the SRC_URI and possibly a set of SRCREVs.

Why? We firstly need to embed this information into the task signature.
If it changes, we know we need to rerun the fetch and re-obtain the
data. We work on inputs to generate this hash, not outputs and we
require all fetcher modules to be able to identify sources like this.

In the case of a git repo, the hash of a git commit is good enough. For
a tarball, it would be a checksum of the tarball. Where there are patch
local files, we include the hashes of those files.

The bottom line is that we already have a hash which represents the
task inputs. Bugs happen, sure. There are also poor fetchers, npm and
go present challenges in particular but we've tried to work around
those issues.

What you're saying is that you don't trust what bitbake does, so you
want all the next level of information about the individual files.

In theory we could put the SRC_URI and SRCREVs into the SPDX as the
source (which could be summarised into a task hash) rather than the
upstream url. It all depends which level you want to break things down
to.

I do see a case for needing the lower level info as in review, you are
going to want to know the delta to the last review decisions. You also
prefer having a different "upstream" url form for some kinds of checks
like CVEs. It does feel a lot like we're trying to duplicate
information and cause significant growth of the SPDX files without an
actual definitive need.

You could equally put in a mapping between a fetch task checksum and
the checksums of all the files that fetch task would expand to if run
(it should always do it deterministicly).

> To be clearer, we are not discussing here the obligation to provide
> the entire corresponding source code as with *GPLv3, but rather we
> are seeking to establish the *provenance* of the software, of all
> bits (also in order to see what patch has been applied by who and to
> close which vulnerability, in case).

My worry is that by not considering the obligation, we don't cater for
a portion of the userbase and by doing so, we limit the possible
adoption.

> Provenance also has a great impact on "reproducibility" of legal work
> on sources. If we are not able to tell what has gone into our package
> from where (and this may prove hard and require a lot of manual - and
> therefore error-prone - work especially in case of complex Yocto
> recipes using f.e. crate/cargo or npm(sw) fetchers), we (lawyers and
> compliance specialists) are at a great disadvantage proving we have
> covered all our bases.

I understand this more than you realise as we have the same problem in
the bitbake fetcher and have spent a lot of time trying to solve it. I
won't claim we're there for some of the modern runtimes and I'd love
help in both explaining to the upstream projects why we need this and
help to technically fix the fetchers so these modern runtimes work
better.

> This is a very good point, and I can vouch that this is really
> important, but maybe you are reading too much in here: at this stage,
> our goal is not to convince anyone to radically change Yocto tasks to
> meet our requirements, but it is to share such requirements and their
> rationale, collect your feedback and possibly adjust them, and also
> to figure out the least impactful solution to meet them (possibly
> without radical changes but just by adding optional functions in
> existing tasks).

"optional functions" fill me with dread, this is the archiver problem I
mentioned.

One of the things I try really hard to do is to have one good way of
doing things rather than multiple options with different levels of
functionality. If you give people choices, they use them. When
someone's build fails, I don't want to have to ask "which fetcher were
you using? Did you configure X or Y or Z?". If we can all use the same
code and codepaths, it means we see bugs, we see regressions and we
have a common experience without the need for complex test matrices.

Worst case you can add optional functions but I kind of see that as a
failure. If we can find something with low overhead which we can all
use, that would be much better. Whether it is possible, I don't know
but it is why we're having the discussion. This is why I have a
preference for trying to keep common code paths for the core though.

> > > - I understand that my solution is a bit hacky; but IMHO any other
> > >    *post-mortem* solution would be far more hacky; the real solution
> > >    would be collecting required information directly in do_fetch and
> > >    do_unpack
> > 
> > Agreed, this needs to be done at unpack/patch time. Don't underestimate
> > the impact of this on general users though as many won't appreciate
> > slowing down their builds generating this information :/.
> 
> Can't this be made optional, so one could just go for the "old" way
> without impacting much? Sorry I'm stepping where I'm naive.

See above :).

> 
> > 
> > There is also a pile of information some legal departments want which
> > you've not mentioned here, such as build scripts and configuration
> > information. Some previous discussions with other parts of the wider
> > open source community rejected Yocto Projects efforts as insufficient
> > since we didn't mandate and capture all of this too (the archiver could
> > optionally do some of it iirc). Is this just the first step and we're
> > going to continue dumping more data? Or is this sufficient and all any
> > legal department should need?
> > 
> 
> I think that trying to give all legal departments what they want
> would prove impossible. I think the idea here is more to start
> building a collectively managed database of provenance and licensing
> data, with a curate set of decision for as many packages available as
> possible. This way that everybody can have some good clue -- and
> increasingly a better one -- as to which license(s) apply to which
> package, removing much of the guesswork that is required today.

It makes sense and is a worthy goal. I just wish we could key this off
bitbake's fetch task checksum rather than having to dump reams of file
checksums!

> We ourselves reuse a lot of information coming from Debian's machine-
> readable information, sometimes finding mistakes and opening issues
> upstream. That helped us to cut the licensing harvesting information
> and review by a great deal.

This does explain why the bitbake fetch mechanism would be a struggle
for you though as you don't want to use our fetch units as your base
component (which is why we end up struggling with some of the issues).

In the interests of moving towards a conclusion, I think what we'll end
up needing to do is generate more information from the fetch and patch
tasks, perhaps with a json file summary of what they do (filenames and
checksums?). That would give your tools data to feed them, even if I'm
not convinced we should be dumping more and more data into the final
SPDX files.

Cheers,

Richard

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#1644): 
https://lists.openembedded.org/g/openembedded-architecture/message/1644
Mute This Topic: https://lists.openembedded.org/mt/93678489/21656
Group Owner: [email protected]
Unsubscribe: https://lists.openembedded.org/g/openembedded-architecture/unsub 
[[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

Re: [Openembedded-architecture] Adding more information to the SBOM

Reply via email to