Hi Alex,

Corruptions are unavoidable part of our life, disks and network can inject 
failures due to unpredictable and unknown reasons like 
https://www.sciencedirect.com/science/article/abs/pii/S0026271421003723. Even 
multi-layer protection is not perfect as it depends on where the error is 
injected.

We proposed this patch as it is aligned with how we currently handle incorrect 
checksums: 
https://github.com/yoctoproject/poky/commit/672c07de4a96eb67eaafba0873eced44ec9ae1a6.

For context, we have builds running at a large scale, almost 24/7. This scale 
contributes to the following challenges:
1. sstate corruption for us happens ~1/10000 of builds. These are extremely 
hard to reproduce and debug...
2. with current behavior of reuploading the corrupted sstate object, we end up 
in an endless loop of rebuilding that we cannot break.
We considered this yocto behavior as a bug, specifically because of #2.

I hear your point, even though following your reasoning we should fail the 
entire build rather than rebuilding the packages. We can explore other options 
like parameterizing the behavior, but it will be very useful to be able to 
break the bad loop of rebuilding somehow.

Thanks,
Max

On 10/7/24, 4:58 AM, "Alexander Kanavin" <[email protected] 
<mailto:[email protected]>> wrote:


CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.






I don't think I can agree with this. This means stability regressions
in one's network infrastructure will go unnoticed and unfixed, and
goes contrary to some people wishing to know immediately when
'corrupted sstate' occurs, and fix it there and then. Similar patches
have been rejected in the past.


Please fix your infra.


Alex


On Sat, 5 Oct 2024 at 02:44, Yu, Max via lists.openembedded.org
<[email protected] 
<mailto:[email protected]>> wrote:
>
> We observe sstate cache corruptions sometimes which cause rebuilds. That is 
> not
> a fatal error as the package has to be rebuilt and updated artifact needs to 
> be
> pushed to remote sstate cache mirror. Currently, Yocto does not handle
> corruptions properly, where the corrupted artifact is not deleted or
> renamed. Later, after the package is built the same corrupted artifact is 
> pushed
> to remote mirror and the same procedure is circled again and again.
>
> This change verifies the outcome of the unpacking action and renames the
> artifact if a fatal error occurred ("tar" tool returns error 2). In such case 
> we
> rename the artifact what causes that a proper one is created and uploaded
> overwriting the exisiting one - the corrupted one - in the remote mirror. That
> way we break the loop of uploading corrupted file again and again.
>
> Suggested-by: Przemyslaw Sobon <[email protected] <mailto:[email protected]>>
> Signed-off-by: Max Yu <[email protected] <mailto:[email protected]>>
> ---
> meta/classes-global/sstate.bbclass | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/meta/classes-global/sstate.bbclass 
> b/meta/classes-global/sstate.bbclass
> index 11bb892a42..5a7ce35341 100644
> --- a/meta/classes-global/sstate.bbclass
> +++ b/meta/classes-global/sstate.bbclass
> @@ -937,7 +937,12 @@ sstate_unpack_package () {
> ZSTD="pzstd -p ${ZSTD_THREADS}"
> fi
>
> - tar -I "$ZSTD" -xvpf ${SSTATE_PKG}
> + if ! tar -I "$ZSTD" -xvpf ${SSTATE_PKG}; then
> + echo "Fatal error extracting sstate cache artifacts, file might be 
> corrupted or truncated, renaming"
> + mv ${SSTATE_PKG} ${SSTATE_PKG}.unpack_error
> + exit 2
> + fi
> +
> # update .siginfo atime on local/NFS mirror if it is a symbolic link
> [ ! -h ${SSTATE_PKG}.siginfo ] || [ ! -e ${SSTATE_PKG}.siginfo ] || touch -a 
> ${SSTATE_PKG}.siginfo 2>/dev/null || true
> # update each symbolic link instead of any referenced file
> --
> 2.40.1
>
>
> 
>



-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#205277): 
https://lists.openembedded.org/g/openembedded-core/message/205277
Mute This Topic: https://lists.openembedded.org/mt/108828269/21656
Group Owner: [email protected]
Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub 
[[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to