Hi all,

Today, while enabling some extra logging paths, I noticed this new API
which is... quite confusing.

It looks like this is the tail end of a workstream essentially focused
on staging workspaces through CAS (if I'm not mistaken), I found this
was initially introduced via:

  https://gitlab.com/BuildStream/buildstream/-/issues/1161
  https://gitlab.com/BuildStream/buildstream/-/merge_requests/1651

If I understand correctly, the presence of buildbox-casd makes it
possible for recursive directory checksum calculations to be optimized
when performed repeatedly (otherwise I don't understand the motivation
to replace the checksumming with CAS digests here).

In any case, this has resulted in the following API on Source:

    BST_KEY_REQUIRES_STAGE = False
    """Whether the source will require staging in order to efficiently
       generate a unique key.
    """

This docstring raises some concern to me.


This is extremely confusing and under documented
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Plugins which set this flag do not implement Plugin.get_unique_key(),
which is a bit of a contradiction.

  * All plugins are expected to implement `get_unique_key()`, there is
    no exception to this explained in detail in the `get_unique_key()`
    documentation.

  * The documentation of this flag does not say anything about how
    the unique key will be resolved.

It turns out, the core does some side stepping of `get_unique_key()` in
the case this flag is specified, and unconditionally stages the source
into CAS and returns the digest from `Source._stage_into_cas()`.


This is a bad regression waiting to happen
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Reading the docstring in question sends a strange message to Source
plugin developers, which is that Sources are allowed to have their
payload present locally in order to derive a cache key.

The above would be a wildly incorrect presumption, as we know well, a
BuildStream project can only ever lack cache keys when sources have
unresolved refs (and they appear as "no reference" in `bst show`, so we
need to manually set them or `bst source track` them).

I think this is a precious invariant we don't want to risk losing.


Added cognitive complexity in source.py
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
There is a large cost in terms of cognitive complexity around the
Source.BST_KEY_REQUIRES_STAGE flag.

There is a functional value to preserve here, but the complexity of
special casing this flag is not necessary to achieve this
functionality.

Let's list the branch statements below to prove the point:

  * We don't call `Source.track()` if it's set.

    However, local sources and workspaces can continue to use a no-op
    here, that is more clear.

  * We add `Source._is_trackable()` for this, just to allow skipping
    running a scheduler job for a source where there will be a no-op.

  * We treat the source specially in `Source._track()` for the case
    that it happens to be called (amongst other element sources), just
    to ensure it has a cache key.

All of the above seems completely pointless, just implement plugin
methods with noops and be done with it, it's better to have the core
treat the plugins in an unbiased fashion than to add switches.

The remaining points:

  * We leverage the CAS to calculate `Plugin.get_unique_key()`

  * We treat it specially at stage time, i.e. we stage within
    CAS because we already staged it at cache key calculation
    time.


Effectively, we half the checksumming for all local data with
this, because we would otherwise sha256sum() the files on disk
only to later add them to the CAS where they will be checksummed again.

Lets deal with only this valuable bit in the proposal below...


Proposed changes
~~~~~~~~~~~~~~~~
The aforementioned functionality should be achievable by the core
offering two utility functions to source plugins (these would be Source
methods, so they would have access to internals like Context, the CAS,
etc).

    def cache_file_or_directory(path: str) -> str:
        """Caches a file or directory in the BuildStream cache and
           returns the digest of the cached directory.
        """

    def stage_cached_directory(sandbox: Sandbox, digest: str, directory: str):
        """Stage a previously cached directory into the sandbox.
        """

In this way, we should be able to only use these in the workspace
mechanics and in the local plugin, remove the documented flag on
Source, remove all the special casing from source.py, and let the core
operate on these sources in an unbiased fashion, removing all the
additionally incurred cognitive complexity added by this flag.

To sweeten the pot further, we could even make these private, since
they are only used by core workspace mechanics and the core "local"
plugin.

Thoughts ?

Cheers,
    -Tristan


Reply via email to