Hi,

On Tue, Nov 3, 2020 at 10:02 AM Tristan Van Berkom <
[email protected]> wrote:

> Hi all,
>
> This proposal seeks to find a practical solution the problem I raised
> earlier this year[0], where Element authors have mistakenly been given
> permission to write directly into the sandbox.
>
> There are a few tradeoffs here and the changes proposed here are
> significant, but the changes overall should result in increased
> reliability, while at the same time granting a compromise which allows
> constructs which write into the sandbox.
>
> I should also note that this is probably one of the last major API
> changes we need to get through in order to unblock the final stages of
> 2.0, in my opinion at least (there is other ongoing work and APIs which
> need review, proposals which need to be written but not necessarily
> implemented, which are on the major blocker list).
>

Sidenote: I do think the Source plugin API change will be another fairly
major one.
https://gitlab.com/BuildStream/buildstream/-/issues/1274
https://gitlab.com/BuildStream/buildstream/-/issues/1275


> This proposal consists of a handful of changes, so I will avoid
> drilling too deeply into the finer details, but instead try to explain
> what these changes are, how they are intertwined, and how they benefit
> us.
>
>
> The proposal
> ------------
>
> The main changes are:
>
>   * Make public data immutable, removing Element.set_public_data() and
>     APIs which allow public data to be mutated.
>

+0.


>   * Make batching of commands issued in the sandbox mandatory.
>

+1.


>   * Force computation of the sandbox commands (if any) to be performed
>     early on, in advance of cache key calculation.
>

It would be good to know what can and cannot be accessed/taken into account
at the time of command computation.


>   * Expose a special utility function which allows plugin provided
>     content to be placed into the sandbox at a specified path.
>
>     Like sandbox commands, the content must be computed early on.
>

Same.


>
> A few words on each of these:
>
>
> Immutable public data
> ~~~~~~~~~~~~~~~~~~~~~
> Since public data can be modified by a build, and since the element's
> build phase is the only point at which we can be certain that all
> dependencies have been built: it is currently an invalid operation to
> read public data from dependencies before the element's build phase.
>
> This is in itself problematic as it is not entirely clear in the docs,
> and there are no assertions in place to clearly enforce this invariant.
>
> It is however desirable to be able to read public data off of
> dependency elements in order to construct sandbox commands, or to
> construct data when needs to be written into the sandbox by an Element
> plugin.
>
>
> Mandatory sandbox command batching
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> This idea has already been in discussion with regards to improving
> performance in remote execution scenarios[1], we should maximize on
> batching wherever possible.
>
> The reason why batching was never mandatory is that an element could
> theoretically need to observe the results of a command in order to
> decide what commands to run next, however we have not observed this
> kind of construct in action before, and we think it is reasonable to
> enforce this.
>
>
> Early assembly of sandbox commands
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> This will require an API change such that commands should be
> constructed early on, probably directly at Plugin.configure() time.
>
> Element will gain some API for command list to be created early on, and
> these commands will be used automatically by the Element base class to
> feed into the cache key.
>
> Given immutability of public data, public data is still a suitable
> candidate for constructing commands (as we do for integration commands
> for example).
>
> The key difference here is that:
>
>   * Plugins can use host python to construct commands in whatever way
>     they please, they should of course be careful about constructing
>     commands deterministically.
>
>   * The result of command construction feeds directly into the cache
>     key, which means that if a plugin ever fails to generate it's
>     commands deterministically, then the worst that can happen is
>     an undesirable rebuild - new cache keys are always generated
>     for any changes in the actual input.
>
>
> Utility for writing into the sandbox
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> The Element will provide a function for writing files into the sandbox,
> and the plugin can generate that content in whatever way it pleases,
> however, like sandbox commands, that content must be computed early on
> before cache key calculation, and the content itself will be
> automatically taken into account in the Element's cache key.
>

The "before cache key calculation" I can agree with, but the idea that no
access to dependencies is possible, I am not convinced on.


> This allows for a similar compromise as we do with sandbox commands
> above, the plugin is given freedom to generate content dynamically, but
> if and when the plugin fails to do this deterministically, then it will
> result in the generation of new cache keys.
>
> I.e. if for whatever reason you could not generate content
> deterministically, you will suffer rebuilts; but you will always have a
> unique cache key for each content blob you decided to write into the
> sandbox.
>

That seems fair and logical.


>
> Side effects
> ------------
> So far, there is one major side effect I can see to this whole plan,
> and while it may change the way we think about how elements interact,
> so far I think it is a good "course correction".
>
> This side effect is that generally, as far as an Element is concerned,
> an Element cannot do anything at all differently based on the results
> of a previously built artifact.
>

This seems overly restrictive.  Especially given that cache keys of
dependency elements are taken into account in calculating the element's
cache key.  And on top the generated content is taken into account as
well.  For determinism there is no difference.
Now, you could argue that this is done for performance reasons, but I would
argue that one can mitigate that.  Elements that do require access to their
dependencies could have an indicator that they need this (e.g.
BST_REQUIRES_DEPENDENCY_ARTIFACTS).  Furthermore, we could potentially
suffice with fetching the Artifact proto, and only allowing access to
metadata via APIs, ie. the directory structure, but not the content of
files.


> While we lose the idea that the python code of an element can do things
> differently based on previously built artifacts, we can still of course
> have elements generate sandbox commands which will behave differently
> depending on what has already been built (which is inevitably true
> anyway, build scripts will generally behave differently depending on
> what is already staged).
>
> This means that any interaction between elements and data which needs
> to be "carried forward" so to speak, *must* absolutely be done within
> the sandbox, which will certainly affect some elements, like dpkg_build
> and dpkg_deploy for instance. These plugins will need to be changed
> such that they no longer attempt to carry data over through mutable
> public data, but instead they should place any data they need to carry
> forward into a well known location in the resulting artifact, so that
> reverse dependencies can work with that later on.
>
>
> Any thoughts on this proposal ?
>

A little later that I desired, but hopefully clear enough to take into
account.


> Cheers,
>     -Tristan
>

Cheers,

Sander


>
>
> [0]:
> https://lists.apache.org/thread.html/re3055975198fa5115b7ce3f533e58518443976b6ae2b13c3be15c881%40%3Cdev.buildstream.apache.org%3E
> [1]: https://gitlab.com/BuildStream/buildstream/-/issues/1293
>
>
>

Reply via email to