Re: [OE-core] [PATCH] sstate: remove corrupted artifacts from local mirror

Yu, Max via lists.openembedded.org Tue, 15 Oct 2024 11:25:31 -0700

> But can you say what your scale actually is? How many sstate objects
> are written into the shared cache per day? How often do you see
> corruptions?

Our sstate cache has 25TB and 12M objects. I do not have the read/write 
metrics, but we run ~5k image builds per day, and we were seeing the corruption 
issue about once per two weeks before this patch. Our image builds range from 
3k - 8k build steps, but these numbers become funky since these are not builds 
from scratch. (These corruptions we're facing are likely not bit flips, the 
example I gave was just trying to illustrate how rare events can happen that 
shouldn't just be summarized with "fix your infra". If it makes things better 
or worse, we only started to see these corruption issues after moving to 
kirkstone where they are now compressed with pzstd...)

For some more context, we at AWS use Yocto to build the OS for a lot of 
hardware platforms and a number of smaller component images. Cartesian product 
the two and we end up with needing to build a lot of images...

Specifically related to sstate caches, we have a remote sstate mirror setup in 
s3. All CI builds update the s3 mirror, so we have a large number of writers. 
For all our builds, we setup a local build directory per repo (a repo can have 
multiple), which is where we host the local sstate cache, and we parallelize 
based on build directory.

> The concern is that your patch does not overwrite the artifact, rather
> it deletes it first, and recreates it later. This creates a time
> window where a cache object exists, and then it doesn't, and then it
> exists again. This will break builds running in parallel in all sorts
> of interesting ways, as sstate is not designed for objects
> disappearing after they've been checked and confirmed to exist by
> bitbake.

This makes a lot of sense. This is a use case we didn't consider when creating 
the patch and not applicable to us. (Since we parallelize by multiple build 
directories that don't share a local sstate cache. Developer builds don't 
really care about the extra local parallelization, while CI builds share the 
remote sstate cache.)

> Then you can take the report and run a script that deletes the
> offending items. This all can be automated, and doesn't have to be
> executed manually.

So that's the tricky thing for us. We will have to run the script at a time 
window where no builds are happening. Because when I tried deleting corrupted 
objects from the s3 sstate cache manually, the corrupted sstate object ended up 
being reuploaded by ongoing builds... And this is pretty difficult to execute 
reliably, since we have so many builds.

I do understand more where you're coming from now, and it does sound like the 
way we implemented this fix is not applicable for all use cases. Can we make 
this a configurable option instead? 

Or if we don't want to remove an existing sstate object from the cache, would 
it be fine to then overwrite existing sstate cache objects after we finish 
rebuild it? With our current remote s3 sstate mirror, we already overwrite the 
remote when the object is different. (I think it might be possible to implement 
a marker during the decompression phase, and later overwrite the previous 
sstate object after the local rebuilds. Will need to look more into it. )

Thanks,
Max

On 10/10/24, 5:19 AM, "Alexander Kanavin" <[email protected] 
<mailto:[email protected]>> wrote:

CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.

On Tue, 8 Oct 2024 at 20:47, Sobon, Przemyslaw <[email protected] 
<mailto:[email protected]>> wrote:
> The "only one reporting" does not mean the problem does not exist, it may be
> just not as big problem for others as for us. Similar problem exist for DRAM
> memory corruptions? Most of the people don't care about that but for some
> this is important problem, e.g. when you see 1 OS crash per 10 years it is
> not a big deal but if you own 10k servers you see 3 crashes per day. That is
> the scale factor that is important. Max talked about our scale already.
> Summarizing, the manual work is not a solution for us due to scale.

But can you say what your scale actually is? How many sstate objects
are written into the shared cache per day? How often do you see
corruptions?

Basically it helps if you introduce yourselves and your product first,
as this is I think you first time interacting with the community?

> I disagree, we can overwrite bad artifact. Yocto indirectly does that as it 
> has
> to rebuild the package. This is "by design" behavior. And to be honest, there 
> is
> no difference between (1) rebuilding the package every time and (2) 
> overwriting
> sstate cache so any other build can reuse it. Is there any concern around
> uploading such freshly built artifact?

The concern is that your patch does not overwrite the artifact, rather
it deletes it first, and recreates it later. This creates a time
window where a cache object exists, and then it doesn't, and then it
exists again. This will break builds running in parallel in all sorts
of interesting ways, as sstate is not designed for objects
disappearing after they've been checked and confirmed to exist by
bitbake.

For example, when we do need to test cache deletions (for instance in
oe-selftest), we make super-sure that this is done on a private small
test cache that isn't shared with anything, as otherwise there have
been notoriously strange failures in random places.

The other concern I expressed to Max: this auto-recovery sweeps the
'flaky hardware' problem under the rug, instead of being loud and
clear about it. If someone had a perfectly working sstate (and many
people do, including the yocto upstream), and then it started throwing
random fails, they're not going to notice it. If someone had very rare
corruptions and then the rate increased, they're not going to notice
that either. Except when they start to wonder why builds seem to take
longer and longer and longer.

> This is random thing, we are not in control of e.g. DRAM bit flip error, they
> simply happen. To simulate the situation you can inject an error yourself by
> e.g., overwriting the random byte of the zstd file before it is uploaded.

Yes, I saw it. The key issue is this bit in sstate.bbclass before
actually creating the sstate archive:

if sstate_pkg.exists():
touch(sstate_pkg)
return

If sstate item exists, then it will not be replaced, even if it's been
determined to be corrupted earlier.

I don't know yet how to best handle this, but I would want to improve
*reporting* of corrupt sstate before we can decide whether yocto can
do something about it that doesn't make things worse than they are
now.

Then you can take the report and run a script that deletes the
offending items. This all can be automated, and doesn't have to be
executed manually.

Alex

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#205923): 
https://lists.openembedded.org/g/openembedded-core/message/205923
Mute This Topic: https://lists.openembedded.org/mt/108828269/21656
Group Owner: [email protected]
Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub 
[[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

Re: [OE-core] [PATCH] sstate: remove corrupted artifacts from local mirror

Reply via email to