On Tue, Jun 25, 2024 at 12:41 PM Mark Hatle
<[email protected]> wrote:
>
> Comments inline below
>
> On 6/24/24 2:10 PM, Joshua Watt wrote:
-- snip --
> > +
> > +SPDX_BUILD_HOST[doc] = "The base variable name to describe the build host 
> > on \
> > +    which a build is running. Must be an SPDX_IMPORTS key"
>
> Is there any sort of documentation or external reference for the variable 
> above
> (as well as the SPDX_ below) that explains what the SPDX standard is expecting
> to be put in there?

Not specifically for this, but for the SPDX 3.0 spec in general, the
web docs are pretty comprehensive:
https://spdx.github.io/spdx-spec/v3.0/ . Although, be aware the
navigation sidebar is really annoying ATM, but that's supposed to get
fixed soon.

For a starter on how SPDX documents are written, see:
https://github.com/spdx/spdx-spec/blob/development/v3.0.1/docs/annexes/getting-started.md

It's a little tricky to encode the SPDX 3 structured data in bitbake
variables; this is what I could come up with so far but if you have
suggestions on improvements, let me know.

Specifically for this variable, it's referencing a "key" in
SPDX_IMPORTS. SPDX_IMPORTS in turn is encoding entries in the
"imports" property of
https://spdx.github.io/spdx-spec/v3.0/model/Core/Classes/SpdxDocument/
. The indirection is necessary so that you can tell users where the
SPDX ID lives, since it's external to this document. It's pretty much
impossible for us to validate that the SPDX ID you put in is real,
short of downloading the referenced document, parsing it, and seeing
if the SPDX ID is present.

For a walk through of how to cross-link SPDX 3 documents, look at:
https://github.com/spdx/spdx-spec/blob/development/v3.0.1/docs/annexes/cross-reference.md

>
> I.e. machine name, host type, an arbitrary string that means something to the
> agent construction the SPDX, etc.  I.e. how would I know what is valid in 
> these
> various things?

There are pretty good tools to validate SPDX documents (offline
even!). 
https://github.com/spdx/spdx-3-model/blob/main/serialization/json_ld/validation.md
gives an overview on how to do this. It still won't do the validation
that the external SPDX ID is valid for the same reasons as above, but
it's pretty good otherwise.

>
> This then leads to a second question, deterministic behavior/reproducibility. 
>  I
> believe the purpose of this is reproducible builds, but we should have a more
> deterministic approach in the Yocto Project where we provide (and/or check for
> host capabilities) to help allow this to be a more generic, many different 
> host
> process.

I've attempted to make this process as deterministic as SPDX 3 allows.
As an example, SPDX IDs are generated by hashing deterministic data
(where as random SPDX IDs would be decidedly simpler!). However, there
are parts of SPDX 3 that are simply not deterministic for various
reasons (please read to the end of this section). They generally fall
under a few categories:

The first category is probably best classified as "not very useful if
deterministic" and omitted by default; SPDX_BUILD_HOST would be one
such examples, since if you are going to set this to the same value
all the time and not reference the _actual_ host, you may as well not
include it at all. For these, I've set no default value (there isn't
one that would make sense anyway), so their omission keeps things
deterministic. However users do actually want this, it will
necessarily result in non-deterministic builds unless you always do
your builds on the same host. I think it might be helpful to annotate
in the doc string which variables will introduce such non-determinism,
so I'll do that.

The second category is "not very useful if deterministic", but are
included in the output by this patch. Examples of this would be build
timestamps and the bitbake parent build tracking (which basically
tracks the invocation of bitbake itself as the "parent" build, so you
can tell which tasks ran in the same invocation). These are useful
pieces of information, and consumers do actually care about these
things, so if push comes to shove we could add a flag to enable them,
but I'm also leery of having too many configuration options for SPDX.

The last category are the require non-deterministic fields in SPDX 3.
The primary offender here is the SPDX creation info "created"
datestamp: 
https://spdx.github.io/spdx-spec/v3.0/model/Core/Classes/CreationInfo/
. This is a mandatory field that is the timestamp of when the SPDX
data itself was created, and every SPDX object you create links to
this so you can track exactly when each object in a merged document
was created. I did attempt to make the argument to SPDX that it was
mandatory non-determinism, but it is very important to the SPDX
community (for reasons I've not fully understood), and they _really_
want it to be the actual document creation date, not SOURCE_DATE_EPOCH
or similar, so I really am not sure what to do about that one. I was
more or less told to ignore these fields when calculating if output is
"deterministic", which is a little annoying, but not an argument I
could win.

I'm a little stuck between the SPDX side and the Yocto side on the
determinism front. It's easy to say "SPDX must be deterministic", and
"it's awful if non-deterministic" if your looking at it from just the
point of view of you want to spit out some data, but equally, it's
easy to say "determinism doesn't really matter" and "the
non-deterministic data is important information" when you are
consuming the data. This particular patch series errs on the side of
making the data the most useful to the end consumers, in part because
I really want Yocto to generate the most comprehensive and useful
output it can; we are a pretty early adopter of SPDX 3, so being able
to provide the most useful data we can early can means consumers
writing downstream tools can use us as a reference which drastically
improves their compatibility with our output. Yocto has a phenomenal
supply chain story to tell, and I really want to tell it to the
fullest extent that we can in our SPDX data. I can't reconcile that
with "everything must be deterministic" though, so..... ?

>
> Which leads to a third question, when a build uses sstate-cache, each host 
> could
> be different then the build host that actually combines the builds into an 
> image
> (SBOM).  Is this a concern?

That is very much on purpose so you can track where your sstate came
from, who built it, etc, but also why we don't set any of these by
default (see above). This is pretty important from a supply chain
tracking perspective, and part of the comprehensive story we can tell
about the supply chain (IMHO anyway).

Currently, we are only tracking the actual do_create_spdx() task as
the "build" (which isn't clear in the generated SPDX, but I'll make
change to fix that), so you have to be a little bit careful about some
of the conclusions you draw from that. Once we get the base SPDX 3
support in place, I want to look at having other sstate tasks generate
SPDX fragements when they run which would allow us to trace those
tasks more precisely. I don't really want to solve that now though as
it is going to be quite a bit more complex to solve in a satisfactory
manner.


>
-- snip --
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#201206): 
https://lists.openembedded.org/g/openembedded-core/message/201206
Mute This Topic: https://lists.openembedded.org/mt/106856878/21656
Group Owner: [email protected]
Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub 
[[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to