On Tue, Jun 25, 2024 at 12:41 PM Mark Hatle <[email protected]> wrote: > > Comments inline below > > On 6/24/24 2:10 PM, Joshua Watt wrote: -- snip -- > > + > > +SPDX_BUILD_HOST[doc] = "The base variable name to describe the build host > > on \ > > + which a build is running. Must be an SPDX_IMPORTS key" > > Is there any sort of documentation or external reference for the variable > above > (as well as the SPDX_ below) that explains what the SPDX standard is expecting > to be put in there?
Not specifically for this, but for the SPDX 3.0 spec in general, the web docs are pretty comprehensive: https://spdx.github.io/spdx-spec/v3.0/ . Although, be aware the navigation sidebar is really annoying ATM, but that's supposed to get fixed soon. For a starter on how SPDX documents are written, see: https://github.com/spdx/spdx-spec/blob/development/v3.0.1/docs/annexes/getting-started.md It's a little tricky to encode the SPDX 3 structured data in bitbake variables; this is what I could come up with so far but if you have suggestions on improvements, let me know. Specifically for this variable, it's referencing a "key" in SPDX_IMPORTS. SPDX_IMPORTS in turn is encoding entries in the "imports" property of https://spdx.github.io/spdx-spec/v3.0/model/Core/Classes/SpdxDocument/ . The indirection is necessary so that you can tell users where the SPDX ID lives, since it's external to this document. It's pretty much impossible for us to validate that the SPDX ID you put in is real, short of downloading the referenced document, parsing it, and seeing if the SPDX ID is present. For a walk through of how to cross-link SPDX 3 documents, look at: https://github.com/spdx/spdx-spec/blob/development/v3.0.1/docs/annexes/cross-reference.md > > I.e. machine name, host type, an arbitrary string that means something to the > agent construction the SPDX, etc. I.e. how would I know what is valid in > these > various things? There are pretty good tools to validate SPDX documents (offline even!). https://github.com/spdx/spdx-3-model/blob/main/serialization/json_ld/validation.md gives an overview on how to do this. It still won't do the validation that the external SPDX ID is valid for the same reasons as above, but it's pretty good otherwise. > > This then leads to a second question, deterministic behavior/reproducibility. > I > believe the purpose of this is reproducible builds, but we should have a more > deterministic approach in the Yocto Project where we provide (and/or check for > host capabilities) to help allow this to be a more generic, many different > host > process. I've attempted to make this process as deterministic as SPDX 3 allows. As an example, SPDX IDs are generated by hashing deterministic data (where as random SPDX IDs would be decidedly simpler!). However, there are parts of SPDX 3 that are simply not deterministic for various reasons (please read to the end of this section). They generally fall under a few categories: The first category is probably best classified as "not very useful if deterministic" and omitted by default; SPDX_BUILD_HOST would be one such examples, since if you are going to set this to the same value all the time and not reference the _actual_ host, you may as well not include it at all. For these, I've set no default value (there isn't one that would make sense anyway), so their omission keeps things deterministic. However users do actually want this, it will necessarily result in non-deterministic builds unless you always do your builds on the same host. I think it might be helpful to annotate in the doc string which variables will introduce such non-determinism, so I'll do that. The second category is "not very useful if deterministic", but are included in the output by this patch. Examples of this would be build timestamps and the bitbake parent build tracking (which basically tracks the invocation of bitbake itself as the "parent" build, so you can tell which tasks ran in the same invocation). These are useful pieces of information, and consumers do actually care about these things, so if push comes to shove we could add a flag to enable them, but I'm also leery of having too many configuration options for SPDX. The last category are the require non-deterministic fields in SPDX 3. The primary offender here is the SPDX creation info "created" datestamp: https://spdx.github.io/spdx-spec/v3.0/model/Core/Classes/CreationInfo/ . This is a mandatory field that is the timestamp of when the SPDX data itself was created, and every SPDX object you create links to this so you can track exactly when each object in a merged document was created. I did attempt to make the argument to SPDX that it was mandatory non-determinism, but it is very important to the SPDX community (for reasons I've not fully understood), and they _really_ want it to be the actual document creation date, not SOURCE_DATE_EPOCH or similar, so I really am not sure what to do about that one. I was more or less told to ignore these fields when calculating if output is "deterministic", which is a little annoying, but not an argument I could win. I'm a little stuck between the SPDX side and the Yocto side on the determinism front. It's easy to say "SPDX must be deterministic", and "it's awful if non-deterministic" if your looking at it from just the point of view of you want to spit out some data, but equally, it's easy to say "determinism doesn't really matter" and "the non-deterministic data is important information" when you are consuming the data. This particular patch series errs on the side of making the data the most useful to the end consumers, in part because I really want Yocto to generate the most comprehensive and useful output it can; we are a pretty early adopter of SPDX 3, so being able to provide the most useful data we can early can means consumers writing downstream tools can use us as a reference which drastically improves their compatibility with our output. Yocto has a phenomenal supply chain story to tell, and I really want to tell it to the fullest extent that we can in our SPDX data. I can't reconcile that with "everything must be deterministic" though, so..... ? > > Which leads to a third question, when a build uses sstate-cache, each host > could > be different then the build host that actually combines the builds into an > image > (SBOM). Is this a concern? That is very much on purpose so you can track where your sstate came from, who built it, etc, but also why we don't set any of these by default (see above). This is pretty important from a supply chain tracking perspective, and part of the comprehensive story we can tell about the supply chain (IMHO anyway). Currently, we are only tracking the actual do_create_spdx() task as the "build" (which isn't clear in the generated SPDX, but I'll make change to fix that), so you have to be a little bit careful about some of the conclusions you draw from that. Once we get the base SPDX 3 support in place, I want to look at having other sstate tasks generate SPDX fragements when they run which would allow us to trace those tasks more precisely. I don't really want to solve that now though as it is going to be quite a bit more complex to solve in a satisfactory manner. > -- snip --
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#201206): https://lists.openembedded.org/g/openembedded-core/message/201206 Mute This Topic: https://lists.openembedded.org/mt/106856878/21656 Group Owner: [email protected] Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub [[email protected]] -=-=-=-=-=-=-=-=-=-=-=-
