On Tue, 8 Oct 2024 at 20:47, Sobon, Przemyslaw <[email protected]> wrote:
> The "only one reporting" does not mean the problem does not exist, it may be
> just not as big problem for others as for us. Similar problem exist for DRAM
> memory corruptions? Most of the people don't care about that but for some
> this is important problem, e.g. when you see 1 OS crash per 10 years it is
> not a big deal but if you own 10k servers you see 3 crashes per day. That is
> the scale factor that is important. Max talked about our scale already.
> Summarizing, the manual work is not a solution for us due to scale.

But can you say what your scale actually is? How many sstate objects
are written into the shared cache per day? How often do you see
corruptions?

Basically it helps if you introduce yourselves and your product first,
as this is I think you first time interacting with the community?

> I disagree, we can overwrite bad artifact. Yocto indirectly does that as it 
> has
> to rebuild the package. This is "by design" behavior. And to be honest, there 
> is
> no difference between (1) rebuilding the package every time and (2) 
> overwriting
> sstate cache so any other build can reuse it. Is there any concern around
> uploading such freshly built artifact?

The concern is that your patch does not overwrite the artifact, rather
it deletes it first, and recreates it later. This creates a time
window where a cache object exists, and then it doesn't, and then it
exists again. This will break builds running in parallel in all sorts
of interesting ways, as sstate is not designed for objects
disappearing after they've been checked and confirmed to exist by
bitbake.

For example, when we do need to test cache deletions (for instance in
oe-selftest), we make super-sure that this is done on a private small
test cache that isn't shared with anything, as otherwise there have
been notoriously strange failures in random places.

The other concern I expressed to Max: this auto-recovery sweeps the
'flaky hardware' problem under the rug, instead of being loud and
clear about it. If someone had a perfectly working sstate (and many
people do, including the yocto upstream), and then it started throwing
random fails, they're not going to notice it. If someone had very rare
corruptions and then the rate increased, they're not going to notice
that either. Except when they start to wonder why builds seem to take
longer and longer and longer.

> This is random thing, we are not in control of e.g. DRAM bit flip error, they
> simply happen. To simulate the situation you can inject an error yourself by
> e.g., overwriting the random byte of the zstd file before it is uploaded.

Yes, I saw it. The key issue is this bit in sstate.bbclass before
actually creating the sstate archive:

        if sstate_pkg.exists():
            touch(sstate_pkg)
            return

If sstate item exists, then it will not be replaced, even if it's been
determined to be corrupted earlier.

I don't know yet how to best handle this, but I would want to improve
*reporting* of corrupt sstate before we can decide whether yocto can
do something about it that doesn't make things worse than they are
now.

Then you can take the report and run a script that deletes the
offending items. This all can be automated, and doesn't have to be
executed manually.

Alex
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#205405): 
https://lists.openembedded.org/g/openembedded-core/message/205405
Mute This Topic: https://lists.openembedded.org/mt/108828269/21656
Group Owner: [email protected]
Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub 
[[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to