I want to point out that the RPM example is not correct. RPMs are unique in Pulp by checksum (aka pkgId in our code and createrepo_c):
https://github.com/pulp/pulp_rpm/blob/44f97560533379ad8680055edff9c3c5bd4e859f/pulp_rpm/app/models.py#L223 Therefore Pulp can store two packages with the same name-epoch-version-arch (NEVRA) as you would in the case where there is a signed and unsigned RPM with the same NEVRA. David On Thu, Nov 8, 2018 at 4:16 PM Simon Baatz <gmbno...@gmail.com> wrote: > On Tue, Nov 06, 2018 at 11:40:35AM -0500, Brian Bouterse wrote: > > These are great questions. I'll try to keep my responses short to > > promote more discussion. > > On Mon, Nov 5, 2018 at 3:21 PM Simon Baatz <[1]gmbno...@gmail.com> > > wrote: > > > > I apologize for the lengthy post, but I did not know where to file > > an issue for > > this. It is a generic problem affecting most Pulp 3 plugins. > > I am puzzled for some time now about the natural keys used for > > content in > > plugins. Examples are: > > pulp_python: 'filename' > > pulp_ansible: 'version', 'role' (for role: 'namespace', 'name') > > pulp_rpm (RPM package): 'name', 'epoch', 'version', 'release', > > 'arch', 'checksum_type', 'pkgId' > > pulp_cookbook: 'name', 'version' > > These look like keys that make sense for content in a single repo > > (version), but > > not necessarily for content in a per plugin pool of content. In an > > ideal world, > > these keys are globally unique, i.e. there is only a single > > "utils-0.9.0" Python > > module world-wide that refers to the same artifacts as the > > "utils-0.9.0" module on > > PyPi. But, as far as I know, the world is far from ideal, especially > > in an > > enterprise setting... > > > > Agreed. This uniqueness is what allows Pulp to recognize and > > deduplicate content in its database. On the filesystem the content > > addressable storage will store identical assets only once, but if Pulp > > couldn't recognize "utils-0.9.0" from one repo as the same as > > "utils-0.9.0" then each sync/upload makes all new content units each > > time. > > > > With the current implementation, the following scenarios could > > happen if I got > > it right: > > 1. In Acme Corp, a team develops a Python module/Ansible role/Chef > > cookbook > > called "acme_utils", which is part of a repo on a Pulp instance. > > Another team > > using different repos happens to choose the same name for their > > unrelated > > utility package. They may not be able to create a content unit if > > they use > > e.g. the same version or file name. > > > > I agree this is an issue > > > > 2. A team happens to choose a name that is already known in > > PyPi/Galaxy/Supermarket. (Or, someone posts a new name on > > PyPi/Galaxy/Supermarket that happens to be in use in the company > > for years). > > Then, as above, the team may not be able to create content units > > for their > > own artifacts. > > > > I agree this is an issue > > > > Additionally, *very ugly* things may happen during a sync. The > > current > > QueryExistingContentUnits stage may decide that, based on the > > natural key, > > completely unrelated content units are already present. The stage > > just puts > > them into the new repo version. > > > > I agree this is an issue > > > > Example for pulp_python: > > Somebody does something very stupid (or very sinister): > > (The files "Django-1.11.16-py2.py3-none-any.whl" and > > "Django-1.11.16.tar.gz" need > > to be in the current directory.) > > export ARTIFACT_HREF=$(http --form POST :8000/pulp/api/v3/artifacts/ > > file@./Django-1.11.16-py2.py3-none-any.whl | jq -r '._href') > > http POST :8000/pulp/api/v3/content/python/packages/ > > artifact=$ARTIFACT_HREF filename=Django-2.0-py3-none-any.whl > > export ARTIFACT_HREF=$(http --form POST :8000/pulp/api/v3/artifacts/ > > file@./Django-1.11.16.tar.gz | jq -r '._href') > > http POST :8000/pulp/api/v3/content/python/packages/ > > artifact=$ARTIFACT_HREF filename=Django-2.0.tar.gz > > > > Yes, this is a problem, and here are some related thoughts. Core > > provides these generic CRUD urls so that plugin writers could get away > > with never writing a "one-shot" viewset that receives and parses a > > content unit via upload in one call. Using "one-shot" uploaders stops > > receiving untrusted metadata from the user (as in your example), but > > unless the units coming in are also signed with a trusted key, the > data > > of the file being uploaded could have been altered. Also the same user > > likely configured that trusted key. > > > > Somebody else wants to mirror Django 2.0 from PyPi > > (version_specifier: "==2.0"): > > > > I think you've gotten to the crux of the issue here ... "someone > else". > > Pulp is not currently able to handle real multi-tenancy. A > > multi-tenancy system would isolate each users content or provide > access > > to content via RBAC. We have gotten requests for multi-tenancy from > > several users who list it as a must-have. I want to connect this > > "user-to-user" sharing problem as actually a multi-tenancy problem. > > > > http POST :8000/pulp/api/v3/repositories/ name=foo > > export REPO_HREF=$(http :8000/pulp/api/v3/repositories/ | jq -r > > '.results[] | select(.name == "foo") | ._href') > > http -v POST :8000/pulp/api/v3/remotes/python/ name='bar' > > url='[2]https://pypi.org/' 'includes:=[{"name": "django", > > "version_specifier":"==2.0"}]' > > export REMOTE_HREF=$(http :8000/pulp/api/v3/remotes/python/ | jq -r > > '.results[] | select(.name == "bar") | ._href') > > http POST :8000$REMOTE_HREF'sync/' repository=$REPO_HREF > > Now the created repo version contains bogus content (Django > > 1.11.16 instead of 2.0): > > $ http :8000/pulp/api/v3/repositories/1/versions/1/content/ | jq > > '.["results"] | map(.version, .artifact)' > > [ > > "1.11.16", > > "/pulp/api/v3/artifacts/1/", > > "1.11.16", > > "/pulp/api/v3/artifacts/2/" > > ] > > A "not so dumb" version of this scenario may happen by error like > > this: > > export ARTIFACT_HREF=$(http --form POST :8000/pulp/api/v3/artifacts/ > > file@./Django-1.11.15-py2.py3-none-any.whl | jq -r '._href') > > http POST :8000/pulp/api/v3/content/python/packages/ > > artifact=$ARTIFACT_HREF filename=Django-1.11.15-py2.py3-none-any.whl > > #Forgot to do this: export ARTIFACT_HREF=$(http --form POST > > :8000/pulp/api/v3/artifacts/ > > file@./Django-1.11.16-py2.py3-none-any.whl | jq -r '._href') > > http POST :8000/pulp/api/v3/content/python/packages/ > > artifact=$ARTIFACT_HREF filename=Django-1.11.16-py2.py3-none-any.whl > > From now on, no synced repo version on the same Pulp instance > > will have a > > Django 1.11.16 wheel. > > > > Similar observation here. If Pulp were a multi-tenant system, only > that > > 1 user would have the screwed up content. > > > > 3. A team releases "module" version "2.0.0" by creating a new > > version of the > > "release" repo. However, packaging went wrong and the release > > needs to be > > rebuilt. Nobody wants to use version "2.0.1" for the new shiny > > release, it > > must be "2.0.0" (the version hasn't been published to the outside > > world yet). > > How does the team publish a new repo version containing the > > re-released > > module? (The best idea I have is: the team needs to create a new > > version > > without the content unit first. Then, find _all_ repo versions > > that still > > reference the content unit and delete them. Delete orphan content > > units. > > Create the new content unit and add it to a new repo version). > > > > Yes this is the same process we imagined users would go through. If > > version "2.0.0" is stored in multiple repos or repo versions to fully > > remove the bad one its unavoidable to unassociate from all repos and > > then orphan cleanup. This process is also motivated by the use case I > > call "get this unit out of here" which is a situation like shellshock > > where: "we know this unit has a CVE in it, it's not safe to store in > > Pulp anymore". In this area I can't think of a better way since > > removing-republishing a unit in a fully-automated way could have > > significant unexpected consequences on published content. It's > probably > > do-able but we would need to be careful. > > > > 4. A Pulp instance contains unsigned RPM content that will be signed > > for > > release. It is not possible to store the signed RPMs on the same > > instance. > > (Or alternatively, someone just forgot to sign the RPMs when > > importing/syncing. They will remain unsigned on subsequent syncs > > even if the > > remote repo has been fixed.) > > > > I agree this is an issue, and we absolutely need to support the > > workflow. > > > > (I did not check the behavior in Pulp 2, but most content types have > > fields like > > checksum/commit/repo_id/digest in their unit key.) > > Before discussing implementation options (changing key, adapt sync), > > I have the > > following questions: > > - Is the assessment of the scenarios outlined above correct? > > > > Yes. The thing to keep in mind through all this is that Pulp needs to > > compose repos which when presented to a client, e.g. pip, dnf, etc > > don't contain the same package twice. So in many ways the uniqueness > is > > about playing that game up front during upload/sync and not on the > > backend during publish time. If it's important to do then doing it > > early I think is key. > > Yes, it is. But the Pulp plugins mentioned above play this game with > thougher rules than actually required. They enforce that any subset > of the entire content pool (for a plugin) plays along these rules, > not just the content I am syncing or putting into a repo version. > > This make it simpler for Pulp (plugins), as they do not need to > ensure constraints when building a new repo version (depending on the > content type there may be constraints outside of the data model that > need to be ensured). But from the perspective of a user this may > lead to very surprising behavior across repositories and > repository versions (e.g. although repo A and D are perfectly > consistent on a per repo view, repo A does not sync anymore because > repo D happened to have a python module with the same filename in a > version from two months ago). > > (Interestingly, pulp_file plays with relaxed rules that do not ensure that > a > repo version can actually be published without clashing filenames. > OTOH, cross repo effects cannot happen there) > > > > > - Do you think it make sense to support (some of) these use cases? > > > > Yes > > > > - If so, are there plans to do so that I am not aware of? > > > > No, except that I believe we need to consider multi-tenancy as a > > possible solution. There are no plans or discussion on that yet (until > > now!). > > I hope the plugin API post GA introduces some signing feature allowing > > users to integrate pulp w/ local and message based signing options. > > This is related to your RPM signing point above > > Although the scenarios outlined above partly are in a multi-tenancy > setting, I don't think that missing support for multi-tenancy is at > the core of the problem. You are right in saying that multi-tenancy > requires isolation on content level. But even without multi-tenancy > (i.e. with full access to all content), I expect a repo manager like > Pulp to provide isolation on repo level: > > 1. Unrelated content from other repos must not become visible in a repo > > If I sync two repos from different remotes, I expect the local repo > versions to be mirrors of the respective upstreams. I don't expect > to find content from repo 1 in the mirror of repo 2 just because it > resembles the actual content of repo 2 on a meta-data level. > > Especially, if the remotes provide cryptographic checksums in their > meta data, I can't find any good justification why Pulp should just > decide to ignore it and add unrelated content to a synced repo > version. > > 2. Content of other repo versions does not impact my ability to > create a repo version (as long as the created repo version is consistent) > > Basically, there are constraints on two levels: > > 1. Uniqueness constraints on overall content > 2. Uniqueness constraints on content of a repo version > > Pulp core currently has no direct support for 2 AFAIK. Some plugins seem > to enforce these constraints on level 1, possibly affecting all repo > versions. 'pulp_file' avoids the latter by having more lenient > constraints on level 1, but it has no level 2 constraints and, thus, > does not ensure that a repo version is publishable. > > Maybe we need support for repo version constraints? > > > _______________________________________________ > Pulp-dev mailing list > Pulp-dev@redhat.com > https://www.redhat.com/mailman/listinfo/pulp-dev >
_______________________________________________ Pulp-dev mailing list Pulp-dev@redhat.com https://www.redhat.com/mailman/listinfo/pulp-dev