On Fri, 2023-09-15 at 09:40 +0100, Richard Purdie via lists.openembedded.org wrote: > The question at this point is what do people want me to do. We clearly > have a really nasty bug in here. The patch is "right" and we do need to > fix this. If I merge it, I suspect I'm going to end up having to chase > this down before we can release and I am going to struggle to find the > time to do it and I suspect my sanity will suffer. This does look to be > a significant issue though.
I've spent some time digging into what is going on and I really don't like what I'm finding. The existing .flush() call in the server logging path is basically injecting a "sync" equivalent in the main command loop within bitbake. That "sync" is effectively by accident maintaining our cache coherence and it is also significantly damaging certain kinds of build performance. If we remove it, bitbake fails to see file changes. In theory nothing should be changing files when builds are running however tinfoil users have already assumed bitbake manages it's cache correctly and with memory resident bitbake, there are signs of cache invalidation misses. Exactly what/where/why I still haven't quite worked out. Since we run oe-selftest with memory resident bitbake, it is particularly badly effected but back to back normal builds are also failing. The challenge is that the sync flushes the inotify watches and without that, writes to files may have not triggered the inotify report which we rely on to invalidate the caches. Even adding os.sync() calls into bitbake isn't solving the problem as we only want to do this when needed for performance and it isn't clear when it is actually needed. What is really needed is a step back and a re-design of the cache management within cooker, there is too many weird code paths that aren't actually that useful any more and we've tried to retrofit cache handling to something which never had it originally which isn't working out so well. Given the release I'm worried about undertaking something like this at such a time, equally, now I've seen the sheer amount of problems and the fact it is just luck things happen to work, I'm very very worried. I've been asked who knows this area of code with a view to working out who we could lean on to help fix it. Sadly, I think I'm probably the only one who does. Cheers, Richard
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#1761): https://lists.openembedded.org/g/openembedded-architecture/message/1761 Mute This Topic: https://lists.openembedded.org/mt/101380274/21656 Group Owner: [email protected] Unsubscribe: https://lists.openembedded.org/g/openembedded-architecture/unsub [[email protected]] -=-=-=-=-=-=-=-=-=-=-=-
