On Fri, 2024-08-02 at 14:50 +0200, Marta Rybczynska wrote: > On Thu, Aug 1, 2024 at 4:25 PM Richard Purdie > <[email protected]> wrote: > > On Fri, 2024-07-26 at 15:02 +0200, Marta Rybczynska wrote: > > > On Thu, Jul 25, 2024 at 5:27 PM Richard Purdie > > > <[email protected]> wrote: > > > > On Thu, 2024-07-25 at 16:48 +0200, [email protected] > > > > wrote: > > > > > On 25.07.2024 16:29, Richard Purdie wrote: > > > > > > Hi Marta, > > > > > > > > > > > > > > > > > > With the v3 series applied we did just see this on the > > > > > > autobuilder > > > > > > unfortunately so I'm not sure that problem is addressed: > > > > > > > > > > > > https://autobuilder.yoctoproject.org/typhoon/#/builders/87/builds/7004/steps/14/logs/stdio > > > > > > > > > > > > > > > > Hello Richard, > > > > > Thanks, this is unfortunate. Is it possible to have a copy of > > > > > the > > > > > corrupted database somewhere? > > > > > > > > I think it is transient as we never clean it up and not all > > > > tasks fail. > > > > That seems to imply it is a race of some kind. > > > > > > I have a few ideas of what it might be, but I do not have a > > > reproducer right now. With the > > > vex changes, the duration of the cve_check operation changed > > > slightly. On the other hand, > > > the database download is slower these days (I have had standalone > > > runs that lasted for 5+ hours). > > > Also, I noticed that there were cancellations of some of the > > > build, so the cancellation of the download > > > may be in play too. > > > > > > A question: autobuilder configuration does share DL_DIR among > > > multiple builds? > > > > DL_DIR is shared between all the workers over NFS. > > > > > My possibility list right now: > > > - the "download" job timeout too short > > > - download failure/timeout > > > - job cancellation during the download > > > > While a download is in progress, the exclusive lock should be held. > > If the database were damaged, I'd then expect all subsequent > > cve_check tasks to fail the same way. > > > > In the failures, 2 or 3 tasks fail, the rest all continue to work. > > So ti doesn't really fit. > > > > > I would suspect there's a fetch job running in addition somewhere and > it manages to do the download. From that point, subsequent checks > will work. But where does that corruption come from - no idea.
We only ever update the database through the recipe though, right? That recipe does have the correct lockfile specified for do_fetch? That should mean it always has an exclusive lock when updating. > I've noticed that tests *could* cause a database update and that the > temporary download path will be the same for all instances > (CVE_DB_TEMP_FILE). This could cause corruption if the lock doesn't > work as we expect it to. I did some tests and the lock does work between different autobuilder nodes. We've noticed that the issue only seems to happen on ubuntu 22.04 which makes me wonder if there is a bug somewhere there such as in the host sqlite3? > Now, between kirkstone and master there should be no corruption, as > this is not the same database - files have different names as changed > in 048ff0ad927f4d37cc5547ebeba9e0c221687ea6. Steve has observed the same issue in kirkstone, only on ubuntu 22.04. > > We could do tweaks to make sure tests do not download the database > (CVE_DB_UPDATE_INTERVAL = "-1"). We could even do a run or two with > that set for the whole build for all configurations, to make sure the > corruption > does not happen at runtime. > > We also have a standalone script to download the database (no change > in the format from the master branch), so we can use it and then > point builds to the copy, while disabling updates. > > The source is here: > https://gitlab.com/syslinbit/public/yocto-vex-check/-/blob/main/cve-update-nvd2-native.py?ref_type=heads > > We can also change the location of the database and always keep it in > TMPDIR > or such. This could mean a long wait for the download. > > Which solution would you prefer to test? I'm not convinced the issue is a parallel fetch, I think sqlite is breaking somehow and it is host specific. I think we should add a do_unpack to the recipe and work from a local copy in TMPDIR, not one over NFS. It can copy from DL_DIR so we shouldn't lose much speed. I'm willing to give that patch a go if that helps and lets you focus on the other patches? Cheers, Richard
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#202908): https://lists.openembedded.org/g/openembedded-core/message/202908 Mute This Topic: https://lists.openembedded.org/mt/107525289/21656 Group Owner: [email protected] Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub [[email protected]] -=-=-=-=-=-=-=-=-=-=-=-
