> > > So that's the tricky thing for us. We will have to run the script at
> > > a time window where no builds are happening. Because when I tried
> > > deleting corrupted objects from the s3 sstate cache manually, the
> > > corrupted sstate object ended up being reuploaded by ongoing
> > > builds... And this is pretty difficult to execute reliably, since we
> > > have so many builds.
> >
> > I'm a little bit puzzled here. How would something upload the corrupt
> > artefact back into the system?
>
> I am also puzzled by this. I was going to suggest that perhaps there's
> a rare bug in zstd compressor that deterministically (and
> reproducibly) creates a corrupt archive on a certain input, because
> ongoing builds can't simply 'reupload' something corrupted that has
> been deleted from sstate. But maybe there's some extra proprietary
> layer of 'overwriting' and 'synchronization' where all this trouble is
> coming from.

Just dug into some of the details more, maybe this is related to how we 
implemented the s3 mirror. For context, our Yocto builds work like this for a 
recipe:
1. download the sstate siginfo+object from s3, into our local sstate cache. (I 
think this is unmodified yocto logic for mirrors just using SSTATE_MIRRORDIR)
2. build
3. upload the local siginfo+object back to s3. (This only happens if s3's 
object is different)

The reupload happens when we delete the s3 object between steps 1 and 3, where 
step 3 will just reupload their local sstate cache objects (corrupted in this 
case). Especially with how yocto parallelizes tasks, this time window between 1 
and 2 is quite large. And with the number of builds that happen, we then have 
no reliable way of cleaning out s3 sstate objects. Note that, even if we 
changed step 3 to only upload if the object doesn't exist, we would still have 
this problem.

Reflecting on our s3 mirror setup, maybe we're doing things in a non-standard 
way, especially regarding step 3.

How do folks normally update sstate mirrors? Do people usually only have a 
single/small number of writers? (For our use case, we want to keep the remote 
sstate mirror very up to date.)


> No, overwriting files is a recipe for disaster. See above, you get
> partially complete files which come out as corrupted.

I understand your concerns with overwriting. What I was suggesting with 
"overwriting" was more creating a new file and `mv`ing it, which should be 
atomic. Same with the s3 operations we use to upload, they will not leave a 
file partially written/overwritten if we trust s3 claims.

> I think this overwriting may be the source of your problems. You should
> not need to or be doing this. Why do you need to?
> 
> It is easy to blame bit flips and put a hack into the system to force
> an overwrite but I think you have a more fundamental issue going on
> which you may want to fix properly.

Like I said, it's probably not something like bit flips, but since we only see 
these corruptions after moving to kirkstone, it might be related to things like 
using pzstd or other infra changes. The s3 uploading/updating logic did not 
change, and we are even using the same boto3 (aws python client) versions. We 
definitely do also want to fix the corruption; it's just been incredibly hard 
to narrow down what is causing it... 

Thanks for your help and patience,
Max

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#205952): 
https://lists.openembedded.org/g/openembedded-core/message/205952
Mute This Topic: https://lists.openembedded.org/mt/108828269/21656
Group Owner: [email protected]
Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub 
[[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to