> But can you say what your scale actually is? How many sstate objects > are written into the shared cache per day? How often do you see > corruptions?
Our sstate cache has 25TB and 12M objects. I do not have the read/write metrics, but we run ~5k image builds per day, and we were seeing the corruption issue about once per two weeks before this patch. Our image builds range from 3k - 8k build steps, but these numbers become funky since these are not builds from scratch. (These corruptions we're facing are likely not bit flips, the example I gave was just trying to illustrate how rare events can happen that shouldn't just be summarized with "fix your infra". If it makes things better or worse, we only started to see these corruption issues after moving to kirkstone where they are now compressed with pzstd...) For some more context, we at AWS use Yocto to build the OS for a lot of hardware platforms and a number of smaller component images. Cartesian product the two and we end up with needing to build a lot of images... Specifically related to sstate caches, we have a remote sstate mirror setup in s3. All CI builds update the s3 mirror, so we have a large number of writers. For all our builds, we setup a local build directory per repo (a repo can have multiple), which is where we host the local sstate cache, and we parallelize based on build directory. > The concern is that your patch does not overwrite the artifact, rather > it deletes it first, and recreates it later. This creates a time > window where a cache object exists, and then it doesn't, and then it > exists again. This will break builds running in parallel in all sorts > of interesting ways, as sstate is not designed for objects > disappearing after they've been checked and confirmed to exist by > bitbake. This makes a lot of sense. This is a use case we didn't consider when creating the patch and not applicable to us. (Since we parallelize by multiple build directories that don't share a local sstate cache. Developer builds don't really care about the extra local parallelization, while CI builds share the remote sstate cache.) > Then you can take the report and run a script that deletes the > offending items. This all can be automated, and doesn't have to be > executed manually. So that's the tricky thing for us. We will have to run the script at a time window where no builds are happening. Because when I tried deleting corrupted objects from the s3 sstate cache manually, the corrupted sstate object ended up being reuploaded by ongoing builds... And this is pretty difficult to execute reliably, since we have so many builds. I do understand more where you're coming from now, and it does sound like the way we implemented this fix is not applicable for all use cases. Can we make this a configurable option instead? Or if we don't want to remove an existing sstate object from the cache, would it be fine to then overwrite existing sstate cache objects after we finish rebuild it? With our current remote s3 sstate mirror, we already overwrite the remote when the object is different. (I think it might be possible to implement a marker during the decompression phase, and later overwrite the previous sstate object after the local rebuilds. Will need to look more into it. ) Thanks, Max On 10/10/24, 5:19 AM, "Alexander Kanavin" <[email protected] <mailto:[email protected]>> wrote: CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. On Tue, 8 Oct 2024 at 20:47, Sobon, Przemyslaw <[email protected] <mailto:[email protected]>> wrote: > The "only one reporting" does not mean the problem does not exist, it may be > just not as big problem for others as for us. Similar problem exist for DRAM > memory corruptions? Most of the people don't care about that but for some > this is important problem, e.g. when you see 1 OS crash per 10 years it is > not a big deal but if you own 10k servers you see 3 crashes per day. That is > the scale factor that is important. Max talked about our scale already. > Summarizing, the manual work is not a solution for us due to scale. But can you say what your scale actually is? How many sstate objects are written into the shared cache per day? How often do you see corruptions? Basically it helps if you introduce yourselves and your product first, as this is I think you first time interacting with the community? > I disagree, we can overwrite bad artifact. Yocto indirectly does that as it > has > to rebuild the package. This is "by design" behavior. And to be honest, there > is > no difference between (1) rebuilding the package every time and (2) > overwriting > sstate cache so any other build can reuse it. Is there any concern around > uploading such freshly built artifact? The concern is that your patch does not overwrite the artifact, rather it deletes it first, and recreates it later. This creates a time window where a cache object exists, and then it doesn't, and then it exists again. This will break builds running in parallel in all sorts of interesting ways, as sstate is not designed for objects disappearing after they've been checked and confirmed to exist by bitbake. For example, when we do need to test cache deletions (for instance in oe-selftest), we make super-sure that this is done on a private small test cache that isn't shared with anything, as otherwise there have been notoriously strange failures in random places. The other concern I expressed to Max: this auto-recovery sweeps the 'flaky hardware' problem under the rug, instead of being loud and clear about it. If someone had a perfectly working sstate (and many people do, including the yocto upstream), and then it started throwing random fails, they're not going to notice it. If someone had very rare corruptions and then the rate increased, they're not going to notice that either. Except when they start to wonder why builds seem to take longer and longer and longer. > This is random thing, we are not in control of e.g. DRAM bit flip error, they > simply happen. To simulate the situation you can inject an error yourself by > e.g., overwriting the random byte of the zstd file before it is uploaded. Yes, I saw it. The key issue is this bit in sstate.bbclass before actually creating the sstate archive: if sstate_pkg.exists(): touch(sstate_pkg) return If sstate item exists, then it will not be replaced, even if it's been determined to be corrupted earlier. I don't know yet how to best handle this, but I would want to improve *reporting* of corrupt sstate before we can decide whether yocto can do something about it that doesn't make things worse than they are now. Then you can take the report and run a script that deletes the offending items. This all can be automated, and doesn't have to be executed manually. Alex
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#205923): https://lists.openembedded.org/g/openembedded-core/message/205923 Mute This Topic: https://lists.openembedded.org/mt/108828269/21656 Group Owner: [email protected] Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub [[email protected]] -=-=-=-=-=-=-=-=-=-=-=-
