Re: [Proposal] Speculative Actions: Predictive Cache Priming for BuildStream

Sander Striker Mon, 27 Oct 2025 19:13:52 -0700

Hi Abderrahim,

On Sun, Oct 26, 2025 at 6:32 PM Abderrahim Kitouni <[email protected]>
wrote:


> Hi Sander,
>
> Le lun. 20 oct. 2025 à 18:09, Sander Striker <[email protected]> a écrit
> :
> >
> > Hi,
> >
> > I’ve been working on a proposal that I’m really excited to share.  It
> could
> > significantly improve BuildStream’s performance on large dependency
> graphs,
> > especially for C/C++ codebases, without changing correctness or requiring
> > new semantics from build tools.  This builds on top of the current state
> of
> > the art, using both Remote Execution for parallelism and using recc to
> > cache and remote execute compilation at the translation unit level.
> >
> > I've put the full proposal in issue #2083
> > <https://github.com/apache/buildstream/issues/2083>, as that renders the
> > included diagrams nicely.  I'll not rehash it here as it is quite long.
> A
> > quick summary:
> > When we build an element, sub-actions are spawned, e.g. compile through
> > recc.  We can record these subactions, and in a next build speculatively
> > execute those same sub-actions, with updated files, at the start of the
> > build.  We can do this aggressively for all actions regardless of what
> > element they correspond to in the dependency graph.  When elements are
> > built, the sub-actions within will receive cache hits, dramatically
> > speeding up the overall end-to-end build.
> >
> > I am quite happy with the limited amount of additional bookkeeping that
> is
> > required to implement the proposal.  I do think it is a step towards
> > Bazel's promise of "{ Fast, Correct } — Choose two", while not requiring
> > all projects/elements to adopt a different build system.
> >
> > I’d love to hear thoughts on
> >
> >    - the proposal in general
>
> I've talked with a few colleagues about this proposal and overall
> people are confused. I think it would be useful to state your premise
> before diving into the technical details.
>

This is the risk of long proposals, I was hoping that the problem statement
captured it: "BuildStream orchestrates repository-level builds across many
elements, allowing each element to use its native builds system, but not
achieving the fine-grained parallelism available within individual
compilation units. C/C++ projects are particularly affected; translation
units could compile in parallel but instead wait sequentially for their
element's turn in the build graph."


> Let me try to state my understanding, and please let me know if it's
> correct. We assume that:
> * The project is already using RECC (or similar) in BuildStream
> * While this provides a welcome improvement, we want even faster builds
> * The user is using remote execution (and not just building locally
> using remote execution API), and have access to large "build grid"
> that is potentially sitting idle
> * One thing we could do is try to predict actions that RECC will
> execute and schedule them early, so that RECC will find them in the
> (remote) action cache when needed
>

This is a very reasonable interpretation :-).  I would not go as far as
stating a "large build grid that is sitting idle", but yes, you would want
to use a remote execution system to benefit; maximizing parallelism on a
single host is not taking full advantage of BuildStreams native Remote
Execution support, which should allow it to go as parallel as possible with
the elements that can build at each point in time.
The assumption of using recc is spot on, recc would generate translation
unit (e.g. .cpp files) compile actions, which are crucial to the proposal.
It is these actions that we can modify and speculatively execute regardless
of what element the translation unit belongs to.  This allows the elements
that build at a later point to benefit from recc cache hits.

In the abstract, let's assume that translation unit compilation is the
slowest part of the build.  Then if you take the slowest translation unit
per element, and build the elements in dependency order, you can project
the minimal time that you need.  Now if you could start these translation
units all at the same time, then you are looking at just the single slowest
of all translation units (plus the cache hit overhead).

Given this, I think it might be too early to start working on this. We
> should look at how much improvement RECC-in-BuildStream brings by
> itself, before trying to optimize further. I have a draft merge
> request for freedesktop-sdk [1], and I hope to start working on
> integrating it in gnome-build-meta [2] once that MR lands.
>

That's definitely great to see.


> Another point is that I'm not sure the complexity of this proposal is
> proportionate to the performance improvement we hope to gain. Happy to
> be proven wrong on this aspect, though :-)
>

The complexity may turn out to be less than expected :-)  And the gains,
well, I would love to do a PoC, so we can test it out.


> >    - the best way to integrate speculative action scheduling with
> >    BuildStream’s element execution model
>
> I think the best way to implement this would be as an additional
> (optional) stage between source fetch and build.


Yes, that was what I was thinking as well.  A new queue in between the pull
queue and build queue.


> I'm not very familiar
> with the inner workings of the scheduler, but I think it is possible
> to have the job run when all the build dependencies of an element have
> either their sources or artifacts ready. We could potentially require
> that some elements have their artifacts, and not accept only sources
> (I'm thinking of stuff like compilers here). At this point, we can
> start scheduling the speculative actions. I feel that we could
> schedule more speculative actions as more build dependencies have
> their artifacts ready, but this means potentially scheduling the
> "cache priming" job more than once for any given element.
>
> One thing I notice is that in your proposal, generating speculative
> actions happens after the element is built. My thinking above points
> more towards having it part of the second build.


I think this would be another queue, so that the build queue is not held up
by speculative action generation.


> Either way, there is
> something important to consider: we need to store information between
> the two builds. Your proposal seems to suggest putting it in the
> artifact, but I am not sure how to retrieve the old artifact from the
> new build: the weak cache key will change if the sources of the
> elements change (which is the most important use case IMO).
>

The trick to that is the ReferencedSpeculativeActions, which backfill via
the elements that have a dependency on the changed element.

>    - any projects you'd like to try this on
>
> gnome-build-meta seems like an ideal target. We try to build with the
> latest main/master of every GNOME project at least once per day, and
> every time we upgrade glib everything needs to be rebuilt including
> two instances of WebKitGtk.


That seems like an excellent use case.


> However it doesn't have access to a
> "potentially idle large build grid", so I'm not sure if we can
> actually implement this.
>

I'm sure we can do some experimentation by standing up some cloud resources
for the experiment, and tear it down after our findings.  I'm curious
enough about this to put some personal $$ towards a cloud provider.

Anyway, these are some quick thoughts (as quick as the complexity of
> the proposal allows). I'll keep thinking about it, I may have more
> ideas.
>

I've been trying to refamiliarize myself with the BuildStream internals to
figure out how to integrate the proposal.  I've put a potential
implementation approach as an issue comment:
https://github.com/apache/buildstream/issues/2083#issuecomment-3454136984.
You can potentially skip ahead to "Scheduler Queue Flow"; I'd love to
validate if that is a sensible approach.  Note that it is describing a
functionally complete implementation, however, I intend to keep it
initially limited to SOURCE type overlays.  I am aware that the
implementation will be performance sensitive for this to work well.

I'm happy to take suggestions for improvements/clarification.  Or to answer
clarifying questions on this subject.

>
> Cheers,
>
> Abderrahim
>

Thanks a lot for taking the time to read the proposal and for your
thoughtful reply.

Cheers,

Sander


>
> [1]
> https://gitlab.com/freedesktop-sdk/freedesktop-sdk/-/merge_requests/22859
> [2] https://gitlab.gnome.org/GNOME/gnome-build-meta/
>

Re: [Proposal] Speculative Actions: Predictive Cache Priming for BuildStream

Reply via email to