Re: Proposal: Sealing up the sandbox, final steps

Tristan Van Berkom Mon, 09 Nov 2020 01:11:46 -0800

Hi Sander,

On Sat, 2020-11-07 at 21:45 +0100, Sander Striker wrote:
> Hi Tristan,
[...]
> 
> I wanted to be clever but coming up blank with train analogies... :-)


Better luck next time :)

> That
> said, please take the below as serious questions, not as a troll.
> 
> 
> > [...]
> > > Maybe a more dynamic way is to have functions like in Element:
> > > 
> > > def requires_artifact_metadata_for_cachekey_calculation(self),
> > >  return [] # list of elements the artifact metadata is required for
> > > 
> > > That could make the pipeline slightly more complicated, but it wouldn't
> > > come at the expense of correctness not performance I think.
> > 
> > Cache keys are derived from project input _only_, we never require the
> > output of an artifact in any way as input to a cache key.
> > 
> > Cache keys can be computed immediately as soon as the project (and
> > subprojects) are loaded, and before ever needing to download any sources or
> > build anything.
> > 
> 
> I know that is how we have been doing things - I don't think it's a bad
> idea to reiterate the why every once in a while including what the benefit
> is to the user.

I wouldn't frame it as arbitrarily as "how we have been doing things",
but rather see this as a cornerstone rule built deep into the
foundations of the architecture, a rule from which we can derive
simplicity, robustness and reliability.

That is of course the crux of why it was difficult to reply to your
original reply: I see this as an immovable object, and you don't.

I do stand by the importance of this rule, but acknowledge that it is
worthwhile to sometimes take a step back and question first principles
and laws of physics.

> > Putting aside that I would really not want to change the above, I don’t
> > even think we rationally can.
> > 
> > We need to know cache keys early on of course in order to download cached
> > artifacts, and we need the *target* cache keys first, in order to know if
> > we even need to build build dependencies.
> > 
> 
> You are correct that we would then end up needing to "unconditionally" pull
> down artifact _metadata_ for those of which data is used in cache key
> calculation.  And this metadata would already be present if you've built
> you project once, or when you've done a build once of an element that would
> trigger a download of the metadata.  After that beyond loading the project
> data, we'll be loading some artifact metadata.

This means that we can no longer purge build-only dependencies of
unchanging elements from the local cache or remote caches, we become
reliant on the build-only branches of a build graph in order to
determine cache keys of toplevel elements.

Whether this is "only metadata" or entire artifacts is not particularly
relevant, as it makes the whole procedure much more fragile.

At best, we would have to build algorithms which preserve the metadata
of an artifact on remote servers but allow purging of the artifact
itself, in order to mitigate this fragility to some degree, and I think
that doing so already adds unjustifiable complexity.

Worse still, is that such complexity would be spread over moving parts
(remote services need to be relied upon for behaving in a specific way
regarding cache expiration, rather than only requiring that a CAS store
exist somewhere).

> I don’t think there is any way that we can have build results ever
> > contribute to cache keys, and I would hope to steer away from changes to
> > this.
> > 
> 
> I think I understand the desire, but I don't think I fully understand or
> appreciate the user value.  If you could help me understand that would be
> great.

I'm going to go off on a bit of a tangent here, but it's an important
one.

<tangent>

In my mind, the number one value we can provide to users is:

  * Anyone who receives your project data can build your project on any
    machine where you can run BuildStream, provided they have a new
    enough version of BuildStream to build your project.

  * You can pick up exactly the same project without modifying it 10
    years later, and reproduce exactly the same build with a new
    BuildStream.

To my mind, this long term stability and reliability is of higher value
than any given feature, and certainly more important than the comfort
level of plugin authors.

As such, an important goal is to reach a point where the features are
all locked in, bugs are fixed to an acceptable degree, and the overall
maintenance effort of the BuildStream project can be taken care of by a
single developer in their spare time - at which point that developer
needs to be extremely defensive that new features are implemented in
such a way that that single maintainer can still afford to maintain it.

I think we both know that developer resources are ephemeral, and that
is why in free software we bet on simplicity and maintainability rather
than betting on availability of resources to put out fires in the
future.

I think that the architectural change under discussion severely
sacrifices simplicity, to an extent that it jeopardizes the longevity
of the project.

This tangent means to emphasize that the main benefit to a user is that
we provide them with a reliable tool.

</tangent>

Setting aside the above, there are expectations which benefit the user
and the maintainer alike:

  * I expect to see the cache keys at the beginning of my log file, and
    the fact that BuildStream knows the cache keys before doing
    anything increases my overall trust in the tool - it is doing
    something that is easily comprehensible and since I understand it,
    I trust it.

  * I expect to be able to zap my cache or a remote cache entirely,
    without relying on automated cache expiry, and I have certain
    expectations of how what that means.

    I don't want to be in a situation where I locally had some metadata
    which allowed me to download a remote artifact, and that after
    zapping my local cache I can no longer download that remote,
    artifact because maybe that remote no longer has the intermediate
    artifact metadata required for my local client to derive the cache
    key of that remote artifact (because I only had the intermediate
    metadata locally in my cache).

  * I expect that a run of `bst show` can get me the cache keys without
    requiring any network activity, and if I have all the refs
    assigned, I expect there to definitely *be* a cache key.

  * Similarly, when writing extensions and doing any scripting around
    the BuildStream CLI, I expect my initial invocation of `bst show`
    to return cache keys, which I might be using for report data or
    computing manifests (like the collect-manifest plugin currently
    does but should really be done as a wrapper script), or I might
    be using as input for later `bst` invocations in my script.

The invariant that cache keys are only ever based on project input
gives them an intrinsic value, changing this will depreciate the value
of the cache key, as we will have diminished ability to rely on them.


> > There is certainly more thought needed here if we need element/python
> > level conditionals based on built artifacts, but for tonight my thoughts
> > are:
> > 
> > * I’m not convinced we do need such conditionals in the plugin (such
> > conditionals can happen in the sandbox)
> > 
> 
> This is the point where I am still torn.  The part of "such conditionals
> can happen in the sandbox" specifically.  It's just not so trivial as it
> sounds - and if we want to keep things relatively cohesive to plugin
> authors, it basically advocates for python in the sandbox as well.  And
> then it's still questionable how to get the required data into the sandbox,
> without the introduction of additional elements in the project graph.  I am
> wondering if we are weighing a self-imposed limitation against user
> complexity.

I don't see it this way at all, rather it was a mistake to have ever
made public data mutable in the first place, and the ways that existing
plugins currently do, like the debian_build/debian_deploy case, are
rather trivial to fix.

While this is a different case, consider `tar_element` in the
experimental plugins:

  * This plugin was written because it seems much easier and
    more convenient, you don't rely on anything particular being
    in the sandbox to interpret your instructions.

  * This is one of the most clear and obvious violations of the
    sandbox, the presumption that a variable host python version
    will generate the same tarball twice given the same input
    is clearly false.

Take the autotools element, which itself performs conditional
processing depending on what is in the sandbox, but without doing so in
the plugin itself.

This processing is not done on dependencies but the principal is the
same with sources (it could just as well be doing conditional
statements based on other files in the sandbox):

  * The element provides an algorithm to the sandbox.

  * This algorithm dictates that different things will be done
    depending on whether there is a generated configure script
    or not.

  * This algorithm requires the presence of certain dependencies
    in the sandbox in order to run, primarily it requires /bin/sh.

  * The algorithm itself is considered as a part of the cache key.

Don't get me wrong, it's not always considered "trivial" to keep the
data processing in the sandbox, and it is always viewed as an obnoxious
burden to have to depend on tooling in the sandbox in order to get
something done.

However, as highlighted in the `tar_element` example, it is certainly
always the right thing to do.

When you go on to say:

  "And then it's still questionable how to get the required data into
   the sandbox, without the introduction of additional elements in the
   project graph"

Of course you need to build a /bin/sh and install a given version of
autotools if you want to use the autotools build plugin, this is
expected and is part and parcel of using a tool which enforces that all
filesystem data permutations happen within the controlled sandbox.

This (original) proposal tries to find the right balance of convenience
to allow some plugins to write some simple data to files in the
sandbox, while also enforcing the rules a bit more strictly to bring
all plugins in line.

Also when you say:

  "it basically advocates for python in the sandbox as well..."

I think python runs quite frequently in the sandbox, but I don't know
why a plugin author would want to choose to write python to run in the
sandbox: What runs in the sandbox is more tightly coupled to the data
being processed, and has nothing to do with the language in which the
plugin itself was authored.

If for example, we had node.js specific project packaging elements
which needed to run dependency data conditional routines, then it would
seem rational to have the plugin stage a JS script to invoke in the
sandbox.

The most popular choice is currently shell script, as it is widely used
by BuildElement classes and the most likely thing to exist in a minimal
runtime.

> > * Any such conditionals cannot affect the cache key
> > 
> 
> If such conditionals were done outside the sandbox, they would have to as
> per the start of the thread.

If we really had to perform conditionals outside of the sandbox, it
would be best to find a way to hash the *algorithm* into the cache key
rather than depend on the result, in some way that is not host python
dependent.

However, I'm not convinced that we really have justification to do such
conditional processing outside of the sandbox.

Cheers,
    -Tristan

Re: Proposal: Sealing up the sandbox, final steps

Reply via email to