On Fri, 2023-09-15 at 09:40 +0100, Richard Purdie via lists.openembedded.org 
wrote:
> The question at this point is what do people want me to do. We clearly
> have a really nasty bug in here. The patch is "right" and we do need to
> fix this. If I merge it, I suspect I'm going to end up having to chase
> this down before we can release and I am going to struggle to find the
> time to do it and I suspect my sanity will suffer. This does look to be
> a significant issue though.

I've spent some time digging into what is going on and I really don't
like what I'm finding.

The existing .flush() call in the server logging path is basically
injecting a "sync" equivalent in the main command loop within bitbake.

That "sync" is effectively by accident maintaining our cache coherence
and it is also significantly damaging certain kinds of build
performance.

If we remove it, bitbake fails to see file changes. In theory nothing
should be changing files when builds are running however tinfoil users
have already assumed bitbake manages it's cache correctly and with
memory resident bitbake, there are signs of cache invalidation misses.
Exactly what/where/why I still haven't quite worked out. Since we run
oe-selftest with memory resident bitbake, it is particularly badly
effected but back to back normal builds are also failing.

The challenge is that the sync flushes the inotify watches and without
that, writes to files may have not triggered the inotify report which
we rely on to invalidate the caches. Even adding os.sync() calls into
bitbake isn't solving the problem as we only want to do this when
needed for performance and it isn't clear when it is actually needed.

What is really needed is a step back and a re-design of the cache
management within cooker, there is too many weird code paths that
aren't actually that useful any more and we've tried to retrofit cache
handling to something which never had it originally which isn't working
out so well. Given the release I'm worried about undertaking something
like this at such a time, equally, now I've seen the sheer amount of
problems and the fact it is just luck things happen to work, I'm very
very worried.

I've been asked who knows this area of code with a view to working out
who we could lean on to help fix it. Sadly, I think I'm probably the
only one who does.

Cheers,

Richard


-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#1761): 
https://lists.openembedded.org/g/openembedded-architecture/message/1761
Mute This Topic: https://lists.openembedded.org/mt/101380274/21656
Group Owner: [email protected]
Unsubscribe: https://lists.openembedded.org/g/openembedded-architecture/unsub 
[[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to