On Tue, 2024-10-15 at 18:25 +0000, Yu, Max via lists.openembedded.org wrote: > > > But can you say what your scale actually is? How many sstate objects > > are written into the shared cache per day? How often do you see > > corruptions? > > Our sstate cache has 25TB and 12M objects. I do not have the > read/write metrics, but we run ~5k image builds per day, and we were > seeing the corruption issue about once per two weeks before this > patch. Our image builds range from 3k - 8k build steps, but these > numbers become funky since these are not builds from scratch. (These > corruptions we're facing are likely not bit flips, the example I gave > was just trying to illustrate how rare events can happen that > shouldn't just be summarized with "fix your infra". If it makes > things better or worse, we only started to see these corruption > issues after moving to kirkstone where they are now compressed with > pzstd...) > > For some more context, we at AWS use Yocto to build the OS for a lot > of hardware platforms and a number of smaller component images. > Cartesian product the two and we end up with needing to build a lot > of images... > > Specifically related to sstate caches, we have a remote sstate mirror > setup in s3. All CI builds update the s3 mirror, so we have a large > number of writers. For all our builds, we setup a local build > directory per repo (a repo can have multiple), which is where we host > the local sstate cache, and we parallelize based on build directory. > > > The concern is that your patch does not overwrite the artifact, rather > > it deletes it first, and recreates it later. This creates a time > > window where a cache object exists, and then it doesn't, and then it > > exists again. This will break builds running in parallel in all sorts > > of interesting ways, as sstate is not designed for objects > > disappearing after they've been checked and confirmed to exist by > > bitbake. > > This makes a lot of sense. This is a use case we didn't consider when > creating the patch and not applicable to us. (Since we parallelize by > multiple build directories that don't share a local sstate cache. > Developer builds don't really care about the extra local > parallelization, while CI builds share the remote sstate cache.) > > > Then you can take the report and run a script that deletes the > > offending items. This all can be automated, and doesn't have to be > > executed manually. > > So that's the tricky thing for us. We will have to run the script at > a time window where no builds are happening. Because when I tried > deleting corrupted objects from the s3 sstate cache manually, the > corrupted sstate object ended up being reuploaded by ongoing > builds... And this is pretty difficult to execute reliably, since we > have so many builds.
I'm a little bit puzzled here. How would something upload the corrupt artefact back into the system? The system has been very carefully designed on how we put artefacts into the sstate directory. We do it by writing to a temporary file, then move into place once the files are 100% complete. The move is atomic even over NFS, something will always win. Once the files exist, we don't overwrite them. This is essential to prevent something from half reading them. My question about "uploading the corrupt" artefact is because this doesn't happen in what we do out the box. I know you put everything into s3 and I worry that the s3 code has races in it somehow letting workers see incomplete files, or overwriting existing files which differ slightly causing the corruption. > I do understand more where you're coming from now, and it does sound > like the way we implemented this fix is not applicable for all use > cases. Can we make this a configurable option instead? > > Or if we don't want to remove an existing sstate object from the > cache, would it be fine to then overwrite existing sstate cache > objects after we finish rebuild it? No, overwriting files is a recipe for disaster. See above, you get partially complete files which come out as corrupted. We have enough scale on our autobuilders we could see that ourselves. > With our current remote s3 sstate mirror, we already overwrite the > remote when the object is different. (I think it might be possible to > implement a marker during the decompression phase, and later > overwrite the previous sstate object after the local rebuilds. Will > need to look more into it. ) I think this overwriting may be the source of your problems. You should not need to or be doing this. Why do you need to? It is easy to blame bit flips and put a hack into the system to force an overwrite but I think you have a more fundamental issue going on which you may want to fix properly. Cheers, Richard
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#205945): https://lists.openembedded.org/g/openembedded-core/message/205945 Mute This Topic: https://lists.openembedded.org/mt/108828269/21656 Group Owner: [email protected] Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub [[email protected]] -=-=-=-=-=-=-=-=-=-=-=-
