gtristan commented on issue #1851: URL: https://github.com/apache/buildstream/issues/1851#issuecomment-1670587778
Hi, > > Ok so at a glance, I can see some pretty bad problems with your plugins... looking right now at https://gitlab.com/carbonOS/build-meta/-/blob/35f91b6d/plugins/gen-cargo-lock.py > > I'm sorry, but you looked at the wrong plugin!! `gen-cargo-lock` is a very special case, as I'll explain below. My other plugins (`gnu` and `go-vendor`) should be a lot more normally implemented > > Also, I didn't write this one! I got it from here: https://gitlab.com/BuildStream/bst-plugins-experimental/-/merge_requests/223 Just trying to be helpful, I think that once you have an understanding of the generally simple architecture, plugin writing becomes fairly intuitive - but I think our docs don't do a great job for new users to get into the flow. > > Your source does not properly implement load_ref(), get_ref(), set_ref(), get_unique_key(), is_cached() or is_resolved() > > This is because it doesn't need to. `gen-cargo-lock` is a source that is always immediately followed by a `cargo` source. The `Cargo.lock` file this plugin generates is used for exactly one purpose: by the `cargo` source while tracking. All the data contained in the `Cargo.lock` file gets turned into buildstream refs by the `cargo` source and put into `project.refs`, so there's no need for the `gen-cargo-lock` plugin to have a unique ID. At build time, `cargo` has all of its dependencies vendored and so it doesn't need to look at `Cargo.lock`. See [here in my `project.refs`](https://gitlab.com/carbonOS/build-meta/-/blob/6f57dc2e/project.refs#L732-1020) for an example of this Yeah, reading this I'm not really confident that this is going to work well. I understand your general logic but I do think this plugin steps out of bounds and will fall into scenarios that will break. There is *definitely* the issue of race conditions around the local caching of the `Cargo.lock` file in `Source.get_mirror_directory()` even if the intention is to treat this as a transiently needed file which is later ignored, parallel tracking of `gen-cargo-lock` sources in different elements will have them all writing to the same file. I'm not sure if the `cargo` plugin will generate the `Cargo.lock` file in the build directory which will likely be relevant at build time, and we could also be intentionally *overwriting* an existing `Cargo.lock` file in the external rust source tree that differs from the one we tracked. Finally, the `gen-cargo-lock` *is* responsible for staging the `Cargo.lock`, and in this setup the `Cargo.lock` will *only* exist on a machine where tracking previously occurred. Also I should point out that with BuildStream 2, the results of `Source.stage()` are cached in the CAS, which means that the unique key of the `gen-cargo-lock` source *by itself* is significant - subsequent runs of `bst source track` + `bst build` will likely end up re-staging an old staging result from previous runs due to the source cache key being unchanged and available in the CAS. I don't think there is any safe way to do this without either: * Serializing the `Cargo.lock` as you mentioned you had previously done, that sounds like the sanest approach * There is another alternative I considered for the sake of *"prettiness"* which is for example, if you have some additional infra * At `Source.track()` time, the resulting Cargo.lock could be uploaded to an artifactory instance as a file named after the `Cargo.lock` sha256sum * The sha256sum could be used as a ref * `Source.fetch()` would download the lock from artifactory This second approach is also totally viable, but I didn't want to suggest it because it involves additional infra / moving parts, so while it may be more *"pretty"* to look at the resulting bst files, it's not a win. [...] > > but I think a fundamental understanding needs to be explained somewhere in the documentation to make plugin authoring more intuitive, I'm not sure where in the docs or how best to explain this. > > I absolutely agree. `bst`'s plugin architecture definitely felt confusing/overwhelming when I first started writing plugins. And I made lots of the mistakes you detail here. I fixed them over time as people tried to build my project in their own clean environment and failed, and that taught me how the system works. But it'd be nice if it could be documented enough to get it right the first time :) > > A lot of what you wrote here, especially the "Overall reproducibility expectations" section, would be useful in the docs. Otherwise the plugin architecture feels a bit impenetrable. Some extra docs regarding URL translation (when should it be done? how? what does it mean to mark a URL? why does this need to be done?) would be nice too; that stuff still confuses me... > > It can probably just be top-level documentation for the `Source` class itself (i.e. [here](https://docs.buildstream.build/2.0/buildstream.source.html)). Worst-case, you can add to the Additional Writings section Yeah I'm not sure where to put this kind of thing, I think probably we need a solid *"Introduction"* which should be required reading for any of the plugin authoring documentation (elements & sources alike), which should help the further reading be more intuitive. On a side note, I am really happy to hear that you have been leveraging the plugin system successfully :) > > As a side note, every source which uses gen-cargo-lock in your project shares the same Cargo.lock file, because you have put it directly at the root of the mirror directory owned by the gen-cargo-lock source kind. > > Hmm so this might be a problem if multiple tracks are happening in parallel. Again, we only need to ferry the data from one source to the very next source, and then that one will turn it into an actual ref. My concern here is that if we have multiple cargo projects being tracked in parallel, they share the one file and trample each-other. Somes like I just need to incorporate the project and element name into the local cache name to avoid a race condition Right, I pre-followed up on just that above, sorry but I have to just scan through and reply ;-) Again, I don't think this approach is really viable. > > Consequently, your plugin risks introducing different data for the same cache key, risking the following scenario: > > This might be an issue if the build chain ends up looking at the `Cargo.lock` file anyway, despite not having a need to since all dependencies are already vendored. > > It would be nice if I could tell, from `stage`, if I'm staging for a track or for a build. If it's for a track I'd stage `Cargo.lock`, but if it's for a build I'd stage nothing to ensure consistency between builds Generally speaking, we don't want to give plugin methods more context than what is strictly necessary, as plugin methods are things we can later decide to call in different ways when refactoring the core and making core improvements - it is best to keep the APIs simple and strict. That said, there is no such context; sources are staged when the core asks the source to stage the data, when the *core* decides, this happens only if the `Source.is_cached()` condition is satisfied. I should point out that the `gen-cargo-lock` lies about `Source.is_cached()` and `Source.is_resolved()` because it pretends to be cached, and pretends to know what data will be cached respectively, when it in fact does not (i.e. when you call `bst build` *without* previously having called `bst source track`, it is inevitably busted). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
