Re: [OE-core] [PATCH] sstate: remove corrupted artifacts from local mirror

Richard Purdie Tue, 15 Oct 2024 16:01:58 -0700

On Tue, 2024-10-15 at 22:43 +0000, Yu, Max wrote:
> > > > So that's the tricky thing for us. We will have to run the
> > > > script at
> > > > a time window where no builds are happening. Because when I
> > > > tried
> > > > deleting corrupted objects from the s3 sstate cache manually,
> > > > the
> > > > corrupted sstate object ended up being reuploaded by ongoing
> > > > builds... And this is pretty difficult to execute reliably,
> > > > since we
> > > > have so many builds.
> > > 
> > > I'm a little bit puzzled here. How would something upload the
> > > corrupt
> > > artefact back into the system?
> > 
> > I am also puzzled by this. I was going to suggest that perhaps
> > there's
> > a rare bug in zstd compressor that deterministically (and
> > reproducibly) creates a corrupt archive on a certain input, because
> > ongoing builds can't simply 'reupload' something corrupted that has
> > been deleted from sstate. But maybe there's some extra proprietary
> > layer of 'overwriting' and 'synchronization' where all this trouble
> > is
> > coming from.
> 
> Just dug into some of the details more, maybe this is related to how
> we implemented the s3 mirror. For context, our Yocto builds work like
> this for a recipe:
> 1. download the sstate siginfo+object from s3, into our local sstate
> cache. (I think this is unmodified yocto logic for mirrors just using
> SSTATE_MIRRORDIR)
> 2. build
> 3. upload the local siginfo+object back to s3. (This only happens if
> s3's object is different)


Perhaps you could change the 3rd step to look at timestamps and only
upload modified files by local timestamp? If you then delete something
from the cache, that should stop it coming back?

> The reupload happens when we delete the s3 object between steps 1 and
> 3, where step 3 will just reupload their local sstate cache objects
> (corrupted in this case). Especially with how yocto parallelizes
> tasks, this time window between 1 and 2 is quite large. And with the
> number of builds that happen, we then have no reliable way of
> cleaning out s3 sstate objects. Note that, even if we changed step 3
> to only upload if the object doesn't exist, we would still have this
> problem.

How about only if the local objects have been modified?

> 
> Reflecting on our s3 mirror setup, maybe we're doing things in a non-
> standard way, especially regarding step 3.
> 
> How do folks normally update sstate mirrors? Do people usually only
> have a single/small number of writers? (For our use case, we want to
> keep the remote sstate mirror very up to date.)

We tend not to recommend updating sstate. In theory once things are
written there, they shouldn't need to change again. Writing isn't a
problem at all as long as things are not overwritten and appear
atomically.

> > No, overwriting files is a recipe for disaster. See above, you get
> > partially complete files which come out as corrupted.
> 
> I understand your concerns with overwriting. What I was suggesting
> with "overwriting" was more creating a new file and `mv`ing it, which
> should be atomic.

Creating a file with a mv is fine. Overwriting with a mv is not atomic,
at least on NFS.

Consider that a build is downloading a large sstate object, say webkit
or something which is say 100+MB large. If you do a mv operation on the
file over NFS on another node to replace that file, the NFS server does
not cache all 100MB of the old file. At some point the transfer will
corrupt as the content changes.

For small files, it isn't as obvious as the files are usually fully in
cache and completes uninterrupted.

>  Same with the s3 operations we use to upload, they will not leave a
> file partially written/overwritten if we trust s3 claims.

If a user is downloading a large object and another replaces it, will
the first user get the full original object?

> > I think this overwriting may be the source of your problems. You
> > should
> > not need to or be doing this. Why do you need to?
> > 
> > It is easy to blame bit flips and put a hack into the system to
> > force
> > an overwrite but I think you have a more fundamental issue going on
> > which you may want to fix properly.
> 
> Like I said, it's probably not something like bit flips, but since we
> only see these corruptions after moving to kirkstone, it might be
> related to things like using pzstd or other infra changes. The s3
> uploading/updating logic did not change, and we are even using the
> same boto3 (aws python client) versions. We definitely do also want
> to fix the corruption; it's just been incredibly hard to narrow down
> what is causing it... 

I'd advise caution from experience. If you fix the wrong thing, you can
set yourself up for other much harder issues down the road. I'm also
reluctant to add things into core classes to work around problems which
aren't in anything we actually maintain or have access to.

Cheers,

Richard

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#205953): 
https://lists.openembedded.org/g/openembedded-core/message/205953
Mute This Topic: https://lists.openembedded.org/mt/108828269/21656
Group Owner: [email protected]
Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub 
[[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

Re: [OE-core] [PATCH] sstate: remove corrupted artifacts from local mirror

Reply via email to