Hi Alex, Corruptions are unavoidable part of our life, disks and network can inject failures due to unpredictable and unknown reasons like https://www.sciencedirect.com/science/article/abs/pii/S0026271421003723. Even multi-layer protection is not perfect as it depends on where the error is injected.
We proposed this patch as it is aligned with how we currently handle incorrect checksums: https://github.com/yoctoproject/poky/commit/672c07de4a96eb67eaafba0873eced44ec9ae1a6. For context, we have builds running at a large scale, almost 24/7. This scale contributes to the following challenges: 1. sstate corruption for us happens ~1/10000 of builds. These are extremely hard to reproduce and debug... 2. with current behavior of reuploading the corrupted sstate object, we end up in an endless loop of rebuilding that we cannot break. We considered this yocto behavior as a bug, specifically because of #2. I hear your point, even though following your reasoning we should fail the entire build rather than rebuilding the packages. We can explore other options like parameterizing the behavior, but it will be very useful to be able to break the bad loop of rebuilding somehow. Thanks, Max On 10/7/24, 4:58 AM, "Alexander Kanavin" <[email protected] <mailto:[email protected]>> wrote: CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. I don't think I can agree with this. This means stability regressions in one's network infrastructure will go unnoticed and unfixed, and goes contrary to some people wishing to know immediately when 'corrupted sstate' occurs, and fix it there and then. Similar patches have been rejected in the past. Please fix your infra. Alex On Sat, 5 Oct 2024 at 02:44, Yu, Max via lists.openembedded.org <[email protected] <mailto:[email protected]>> wrote: > > We observe sstate cache corruptions sometimes which cause rebuilds. That is > not > a fatal error as the package has to be rebuilt and updated artifact needs to > be > pushed to remote sstate cache mirror. Currently, Yocto does not handle > corruptions properly, where the corrupted artifact is not deleted or > renamed. Later, after the package is built the same corrupted artifact is > pushed > to remote mirror and the same procedure is circled again and again. > > This change verifies the outcome of the unpacking action and renames the > artifact if a fatal error occurred ("tar" tool returns error 2). In such case > we > rename the artifact what causes that a proper one is created and uploaded > overwriting the exisiting one - the corrupted one - in the remote mirror. That > way we break the loop of uploading corrupted file again and again. > > Suggested-by: Przemyslaw Sobon <[email protected] <mailto:[email protected]>> > Signed-off-by: Max Yu <[email protected] <mailto:[email protected]>> > --- > meta/classes-global/sstate.bbclass | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/meta/classes-global/sstate.bbclass > b/meta/classes-global/sstate.bbclass > index 11bb892a42..5a7ce35341 100644 > --- a/meta/classes-global/sstate.bbclass > +++ b/meta/classes-global/sstate.bbclass > @@ -937,7 +937,12 @@ sstate_unpack_package () { > ZSTD="pzstd -p ${ZSTD_THREADS}" > fi > > - tar -I "$ZSTD" -xvpf ${SSTATE_PKG} > + if ! tar -I "$ZSTD" -xvpf ${SSTATE_PKG}; then > + echo "Fatal error extracting sstate cache artifacts, file might be > corrupted or truncated, renaming" > + mv ${SSTATE_PKG} ${SSTATE_PKG}.unpack_error > + exit 2 > + fi > + > # update .siginfo atime on local/NFS mirror if it is a symbolic link > [ ! -h ${SSTATE_PKG}.siginfo ] || [ ! -e ${SSTATE_PKG}.siginfo ] || touch -a > ${SSTATE_PKG}.siginfo 2>/dev/null || true > # update each symbolic link instead of any referenced file > -- > 2.40.1 > > > >
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#205277): https://lists.openembedded.org/g/openembedded-core/message/205277 Mute This Topic: https://lists.openembedded.org/mt/108828269/21656 Group Owner: [email protected] Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub [[email protected]] -=-=-=-=-=-=-=-=-=-=-=-
