Hi all, Today, while enabling some extra logging paths, I noticed this new API which is... quite confusing.
It looks like this is the tail end of a workstream essentially focused on staging workspaces through CAS (if I'm not mistaken), I found this was initially introduced via: https://gitlab.com/BuildStream/buildstream/-/issues/1161 https://gitlab.com/BuildStream/buildstream/-/merge_requests/1651 If I understand correctly, the presence of buildbox-casd makes it possible for recursive directory checksum calculations to be optimized when performed repeatedly (otherwise I don't understand the motivation to replace the checksumming with CAS digests here). In any case, this has resulted in the following API on Source: BST_KEY_REQUIRES_STAGE = False """Whether the source will require staging in order to efficiently generate a unique key. """ This docstring raises some concern to me. This is extremely confusing and under documented ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Plugins which set this flag do not implement Plugin.get_unique_key(), which is a bit of a contradiction. * All plugins are expected to implement `get_unique_key()`, there is no exception to this explained in detail in the `get_unique_key()` documentation. * The documentation of this flag does not say anything about how the unique key will be resolved. It turns out, the core does some side stepping of `get_unique_key()` in the case this flag is specified, and unconditionally stages the source into CAS and returns the digest from `Source._stage_into_cas()`. This is a bad regression waiting to happen ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Reading the docstring in question sends a strange message to Source plugin developers, which is that Sources are allowed to have their payload present locally in order to derive a cache key. The above would be a wildly incorrect presumption, as we know well, a BuildStream project can only ever lack cache keys when sources have unresolved refs (and they appear as "no reference" in `bst show`, so we need to manually set them or `bst source track` them). I think this is a precious invariant we don't want to risk losing. Added cognitive complexity in source.py ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ There is a large cost in terms of cognitive complexity around the Source.BST_KEY_REQUIRES_STAGE flag. There is a functional value to preserve here, but the complexity of special casing this flag is not necessary to achieve this functionality. Let's list the branch statements below to prove the point: * We don't call `Source.track()` if it's set. However, local sources and workspaces can continue to use a no-op here, that is more clear. * We add `Source._is_trackable()` for this, just to allow skipping running a scheduler job for a source where there will be a no-op. * We treat the source specially in `Source._track()` for the case that it happens to be called (amongst other element sources), just to ensure it has a cache key. All of the above seems completely pointless, just implement plugin methods with noops and be done with it, it's better to have the core treat the plugins in an unbiased fashion than to add switches. The remaining points: * We leverage the CAS to calculate `Plugin.get_unique_key()` * We treat it specially at stage time, i.e. we stage within CAS because we already staged it at cache key calculation time. Effectively, we half the checksumming for all local data with this, because we would otherwise sha256sum() the files on disk only to later add them to the CAS where they will be checksummed again. Lets deal with only this valuable bit in the proposal below... Proposed changes ~~~~~~~~~~~~~~~~ The aforementioned functionality should be achievable by the core offering two utility functions to source plugins (these would be Source methods, so they would have access to internals like Context, the CAS, etc). def cache_file_or_directory(path: str) -> str: """Caches a file or directory in the BuildStream cache and returns the digest of the cached directory. """ def stage_cached_directory(sandbox: Sandbox, digest: str, directory: str): """Stage a previously cached directory into the sandbox. """ In this way, we should be able to only use these in the workspace mechanics and in the local plugin, remove the documented flag on Source, remove all the special casing from source.py, and let the core operate on these sources in an unbiased fashion, removing all the additionally incurred cognitive complexity added by this flag. To sweeten the pot further, we could even make these private, since they are only used by core workspace mechanics and the core "local" plugin. Thoughts ? Cheers, -Tristan
