Updated the redmine story https://pulp.plan.io/issues/6353 with this proposal. Feel free to comment there as well. If there are no objections, I'll start on a PoC once I finish the items I'm working on right now.
Thanks, Tanya On Wed, Oct 21, 2020 at 8:48 PM Tatiana Tereshchenko <ttere...@redhat.com> wrote: > TL;DR: An attempt to propose the least invasive option to solve the case > when remote repository metadata needs to be mirrored. Please provide > feedback if you are interested in the outcome. > > There have been multiple attempts and discussions to solve the > relative_path problem in a general way which covers multiple use cases. > They all look very invasive and only possible to be done in Pulp 4+ due to > the amount and significance of changes that needs to be made, to the data > models and/or to the API. > > The following proposal solves only this use case: As a user, I can mirror > remote repository metadata as is. > An additional goal is to avoid backward incompatible changes and ideally > leave a way for further improvement to solve the problem in a more general > way. > (The following proposal does NOT solve a use case: As a user, I can have > the same content under different relative paths in any repository.) > > Proposal: > - Have a way to distinguish between repositories with managed content and > with the exact mirror (e.g. create a repository with exact_mirror=True or a > new dedicated repository type) > - For such repos, create a publication at sync time (includes published > artifacts and metadata). > - For such repos, publish is no-op and always returns the existing > publication for the requested repo version. > - For such repos, no modifications are allowed except the sync in mirror > mode. (At least for RPM plugin, I believe we can't allow discrepancies > between metadata and content in a repo, especially if some content is > removed.) > > Pros: > - non-invasive, only additive model changes > - can be implemented in a plugin which needs it or it can be moved to the > pulpcore if it allows plugin input at certain points. > - leaves a way for further improvement to handle a more general case, see > the full proposal here > https://hackmd.io/02KBjCD3Q0WP7p4ALwzhJw#relative_path-in-PublishedArtifact-only > > Cons: > - doesn't solve the problem of various relative paths for the same > content in general way > - a separate code path (at times) to handle this type of repositories. > > For reference: > - hackmd doc with all the considered proposals and the summary of the > potentially valid ones https://hackmd.io/02KBjCD3Q0WP7p4ALwzhJw > - pulpcon video with the discussion of the proposals > https://www.youtube.com/watch?v=7IzxAQYr5-I > > Thanks, > Tanya > > On Thu, May 7, 2020 at 2:07 PM Brian Bouterse <bmbou...@redhat.com> wrote: > >> I agree with that problem statement. pulp_file may want to have the same >> Content at two different paths in different RepositoryVersions (or even the >> same RepositoryVersion). Without this capability a user could never "move" >> where content lives in a RepositoryVersion if its already been placed in >> any other RepositoryVersion. >> >> Additionally pulp_maven may need to sync two repositories in the wild >> that already contain the same content in two locations. I offer this as >> example not to pile-on, but because it's a multi-content artifact which I >> believe we will need to consider also as we work towards a solution. >> >> I've been spending time on developing a solution, but it needs more work >> so it's not ready yet. Also other katello and galaxy_ng work continues to >> pre-empt this, so it could take a while. >> >> On Thu, May 7, 2020 at 3:39 AM Matthias Dellweg <mdell...@redhat.com> >> wrote: >> >>> > Users need to be able to store the same content unit at different >>> relative paths in different repository versions. This problem is not unique >>> to the RPM plugin. Do we agree about that? >>> Yes, we agree. In pulp_deb relative_path is part of the contents >>> natural_key to circumvent this problem. So this creates two content units >>> that only differ in relativ_path. At least they share the artifact. >>> >>> On Thu, May 7, 2020 at 2:06 AM Dennis Kliban <dkli...@redhat.com> wrote: >>> >>>> I'd like to provide a little bit more context for my previous email by >>>> going back to the original problem statement: >>>> >>>> On Wed, Apr 1, 2020 at 9:23 AM Daniel Alley <dal...@redhat.com> wrote: >>>> >>>>> Problem: >>>>> >>>>> Currently, a relative_path is tied to content in Pulp. This means that >>>>> if a content unit exists in two places within a repository or across >>>>> repositories, it has to be stored as two separate content units. This >>>>> creates redundant data and potential confusion for users. >>>>> >>>>> As a specific example, we need to support mirroring content in >>>>> pulp_rpm <https://pulp.plan.io/issues/6353>. Currently, for each >>>>> location at which a single package is stored, we’ll need to create a >>>>> content unit. We could end up with several records representing a single >>>>> package. Users may be confused about why they see multiple records for a >>>>> package and they may have trouble for example deciding which content unit >>>>> to copy. >>>>> >>>> Users need to be able to store the same content unit at different >>>> relative paths in different repository versions. This problem is not unique >>>> to the RPM plugin. Do we agree about that? >>>> >>>> I've been working on a potential solution that solves this problem in a >>>> document[0]. It is a complicated change and the document does not fully >>>> capture the plan yet. Feedback and help on the design is welcome. >>>> >>>> [0] https://hackmd.io/02KBjCD3Q0WP7p4ALwzhJw?edit >>>> >>>> >>>> On Mon, May 4, 2020 at 4:11 PM Dennis Kliban <dkli...@redhat.com> >>>> wrote: >>>> >>>>> I've reached two conclusions while trying to formulate a solution: >>>>> >>>>> This problem needs to be solved at the repository version level. >>>>> Repository membership needs to be tracked at the artifact level, and not >>>>> content level as it is now. >>>>> >>>>> On Thu, Apr 30, 2020 at 1:11 PM Daniel Alley <dal...@redhat.com> >>>>> wrote: >>>>> >>>>>> Cool, so the only difference is whether to try to store the >>>>>> relationship in the DB, or leverage the fact that we already have the >>>>>> metadata and just re-parse it. >>>>>> >>>>>> I know the latter approach has yet to be written up, but my concern >>>>>> there is that adding another layer of indirection between "repository >>>>>> version" and "content" is going to have an adverse impact on performance, >>>>>> since it is already the most complex and demanding query we issue to the >>>>>> DB >>>>>> and one of the most common and important. >>>>>> >>>>>> On Thu, Apr 30, 2020 at 12:50 PM David Davis <davidda...@redhat.com> >>>>>> wrote: >>>>>> >>>>>>> Yes but I was imagining the mapping would be stored not as Content >>>>>>> but as a separate object. So we wouldn't use filename for the mapping >>>>>>> (rather we'd use ContentArtifact pk) and we wouldn't need to change >>>>>>> ContentArtifact's relative_path at all. That said, I think your solution >>>>>>> captures the idea though and is better in some ways. >>>>>>> >>>>>>> Changing the RepositoryContent model to point to ContentArtifacts >>>>>>> and store relative_paths is probably the best and most correct solution >>>>>>> in >>>>>>> theory. However, it's going to be painful to implement for both core and >>>>>>> plugins. >>>>>>> >>>>>>> David >>>>>>> >>>>>>> >>>>>>> On Thu, Apr 30, 2020 at 12:33 PM Daniel Alley <dal...@redhat.com> >>>>>>> wrote: >>>>>>> >>>>>>>> @David Davis <davidda...@redhat.com> so this proposal would go >>>>>>>> something like this, correct?: >>>>>>>> >>>>>>>> * For the signed metadata / exact mirror use-case we need to store >>>>>>>> the repository metadata itself as a content unit inside the >>>>>>>> RepositoryVersion anyway (because the hash must be equal) >>>>>>>> * Because we have this metadata lying around, we can reference it >>>>>>>> at publish time to discover the appropriate >>>>>>>> PublishedArtifact.relative_path >>>>>>>> * Create a map of "filename" -> "location_href" and look up the >>>>>>>> filename of each RPM package to find the appropriate path >>>>>>>> * This should be pretty fast for the RPM plugin since >>>>>>>> createrepo_c is doing all the hard work >>>>>>>> * Data migration to ensure ContentArtifact.relative_path is only >>>>>>>> storing the filename (and I would suggest we also change the name to >>>>>>>> "filename") >>>>>>>> * If metadata isn't present in the RepositoryVersion, then just >>>>>>>> tweak the PublishedArtifact.relative_path so that it uses whichever our >>>>>>>> default repo layout is >>>>>>>> >>>>>>>> On Tue, Apr 28, 2020 at 11:41 AM David Davis <davidda...@redhat.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Yes, that's correct. During our meeting we discussed two options: >>>>>>>>> the first was to extend RepositoryContent to store relative path per >>>>>>>>> ContentArtifact as storing a relative_path per Content won't work for >>>>>>>>> multi-Artifact Content units. >>>>>>>>> >>>>>>>>> An alternative that I pitched was to have plugins (or maybe even >>>>>>>>> core someday) store this information outside RepositoryContent and >>>>>>>>> then use >>>>>>>>> this information during publishing to set relative_path on >>>>>>>>> PublishedArtifacts. We'd have to modify the content app if we wanted >>>>>>>>> to >>>>>>>>> support pass through publications but I think asking plugins to use >>>>>>>>> published artifacts in this case is warranted. That said, I don't >>>>>>>>> think >>>>>>>>> anyone else was keen on this idea though. >>>>>>>>> >>>>>>>>> David >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Apr 28, 2020 at 10:30 AM Matthias Dellweg < >>>>>>>>> mdell...@redhat.com> wrote: >>>>>>>>> >>>>>>>>>> That is only used for passthrough publication afaik. If you >>>>>>>>>> publish each content unit "by hand", you create a new relative path >>>>>>>>>> for >>>>>>>>>> each published artifact. That is, why it can be empty and still the >>>>>>>>>> content >>>>>>>>>> can be published. >>>>>>>>>> >>>>>>>>>> On Tue, Apr 28, 2020 at 4:09 PM Daniel Alley <dal...@redhat.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> We realized in our discussion that the original proposal >>>>>>>>>>> described in my email will not work, because "relative_path" >>>>>>>>>>> ultimately >>>>>>>>>>> describes the path of the published *artifacts* (not content), >>>>>>>>>>> and for content types with multiple artifacts, storing this >>>>>>>>>>> information in >>>>>>>>>>> a field on RepositoryContent would not be possible. >>>>>>>>>>> >>>>>>>>>>> On Mon, Apr 27, 2020 at 6:08 PM Daniel Alley <dal...@redhat.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> There is a video call scheduled to discuss this issue tomorrow >>>>>>>>>>>> (Tuesday April 28th) at 13:30 UTC (please convert to your local >>>>>>>>>>>> time). >>>>>>>>>>>> https://meet.google.com/scy-csbx-qiu >>>>>>>>>>>> >>>>>>>>>>>> On Sat, Apr 25, 2020 at 7:02 AM David Davis < >>>>>>>>>>>> davidda...@redhat.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> I had a chance to think about this some more yesterday and >>>>>>>>>>>>> wanted to email out my thoughts. I also think that this change >>>>>>>>>>>>> sounds scary >>>>>>>>>>>>> and will have a big impact on plugin writers so I thought of a >>>>>>>>>>>>> couple >>>>>>>>>>>>> alternatives: >>>>>>>>>>>>> >>>>>>>>>>>>> First, we could add a relative_path field to RepositoryContent >>>>>>>>>>>>> instead of moving it there. This would be an optional field. It >>>>>>>>>>>>> would be up >>>>>>>>>>>>> to plugins to manage this field and they would still need to >>>>>>>>>>>>> populate the >>>>>>>>>>>>> relative_path field on ContentArtifact. But plugins could use >>>>>>>>>>>>> this optional >>>>>>>>>>>>> field to store relative paths per repository and then use this >>>>>>>>>>>>> field when >>>>>>>>>>>>> generating publications. >>>>>>>>>>>>> >>>>>>>>>>>>> The second alternative is one that is already laid out in the >>>>>>>>>>>>> original email but to call it out again: it would be to not solve >>>>>>>>>>>>> this in >>>>>>>>>>>>> pulpcore. RPM would create its own object that would map content >>>>>>>>>>>>> in a >>>>>>>>>>>>> repository to relative_paths. >>>>>>>>>>>>> >>>>>>>>>>>>> David >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Apr 21, 2020 at 9:22 AM Quirin Pamp <p...@atix.de> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> I am not currently very well versed in the classes involved, >>>>>>>>>>>>>> but moving relative_path around sounds slightly scary with the >>>>>>>>>>>>>> potential to >>>>>>>>>>>>>> break things. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> As such, I would be interested to be kept in the loop as this >>>>>>>>>>>>>> moves forward. (Mailing list once there is some movement is >>>>>>>>>>>>>> entirely >>>>>>>>>>>>>> sufficient 😉) >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Quirin Pamp >>>>>>>>>>>>>> ------------------------------ >>>>>>>>>>>>>> *From:* pulp-dev-boun...@redhat.com < >>>>>>>>>>>>>> pulp-dev-boun...@redhat.com> on behalf of Ina Panova < >>>>>>>>>>>>>> ipan...@redhat.com> >>>>>>>>>>>>>> *Sent:* 21 April 2020 14:07:13 >>>>>>>>>>>>>> *To:* Daniel Alley <dal...@redhat.com> >>>>>>>>>>>>>> *Cc:* Pulp-dev <pulp-dev@redhat.com> >>>>>>>>>>>>>> *Subject:* Re: [Pulp-dev] the "relative path" problem >>>>>>>>>>>>>> >>>>>>>>>>>>>> Daniel, >>>>>>>>>>>>>> >>>>>>>>>>>>>> how about setting up a meeting and brainstorm the >>>>>>>>>>>>>> alternatives, pros/cons there? >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -------- >>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Ina Panova >>>>>>>>>>>>>> Senior Software Engineer| Pulp| Red Hat Inc. >>>>>>>>>>>>>> >>>>>>>>>>>>>> "Do not go where the path may lead, >>>>>>>>>>>>>> go instead where there is no path and leave a trail." >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, Apr 17, 2020 at 5:57 PM Daniel Alley < >>>>>>>>>>>>>> dal...@redhat.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Bump, this item needs to move forwards soon. Does anyone >>>>>>>>>>>>>> have any thoughts? >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, Apr 1, 2020 at 9:40 AM Pavel Picka <ppi...@redhat.com> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>> I'd like to add one more question to this topic. Do you think >>>>>>>>>>>>>> it is a blocker for PRs [0] & [1] as by testing [2] this >>>>>>>>>>>>>> features I haven't >>>>>>>>>>>>>> run into real world example where two really same name packages >>>>>>>>>>>>>> appears. >>>>>>>>>>>>>> I think this is a 'must have' feature but until we >>>>>>>>>>>>>> solve/decide it we can have two features working may with >>>>>>>>>>>>>> warning in docs >>>>>>>>>>>>>> for users that can happen in some 'special' repositories. >>>>>>>>>>>>>> >>>>>>>>>>>>>> To follow topic directly I like proposed move to >>>>>>>>>>>>>> 'RepositoryContent' and add it to its uniqueness constraint (if I >>>>>>>>>>>>>> understand well). >>>>>>>>>>>>>> >>>>>>>>>>>>>> [0] https://github.com/pulp/pulp_rpm/pull/1657 >>>>>>>>>>>>>> [1] https://github.com/pulp/pulp_rpm/pull/1642 >>>>>>>>>>>>>> [2] tested with centos 7, 8, opensuse and SLE repositories >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, Apr 1, 2020 at 3:22 PM Daniel Alley < >>>>>>>>>>>>>> dal...@redhat.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> We'd like to start a discussion on the "relative path >>>>>>>>>>>>>> problem" identified recently. >>>>>>>>>>>>>> Problem: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Currently, a relative_path is tied to content in Pulp. This >>>>>>>>>>>>>> means that if a content unit exists in two places within a >>>>>>>>>>>>>> repository or >>>>>>>>>>>>>> across repositories, it has to be stored as two separate content >>>>>>>>>>>>>> units. >>>>>>>>>>>>>> This creates redundant data and potential confusion for users. >>>>>>>>>>>>>> >>>>>>>>>>>>>> As a specific example, we need to support mirroring content >>>>>>>>>>>>>> in pulp_rpm <https://pulp.plan.io/issues/6353>. Currently, >>>>>>>>>>>>>> for each location at which a single package is stored, we’ll >>>>>>>>>>>>>> need to create >>>>>>>>>>>>>> a content unit. We could end up with several records >>>>>>>>>>>>>> representing a single >>>>>>>>>>>>>> package. Users may be confused about why they see multiple >>>>>>>>>>>>>> records for a >>>>>>>>>>>>>> package and they may have trouble for example deciding which >>>>>>>>>>>>>> content unit >>>>>>>>>>>>>> to copy. >>>>>>>>>>>>>> Proposed Solution: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Move “relative_path” from its current location on >>>>>>>>>>>>>> ContentArtifact, to RepositoryContent. This will require a >>>>>>>>>>>>>> sizable data >>>>>>>>>>>>>> migration. It is possibly the case that in rare cases, >>>>>>>>>>>>>> repository versions >>>>>>>>>>>>>> may change slightly due to deduplication. >>>>>>>>>>>>>> >>>>>>>>>>>>>> A repository-version-wide uniqueness constraint will be >>>>>>>>>>>>>> present on “relative_path”, independently of any other >>>>>>>>>>>>>> repository uniquness >>>>>>>>>>>>>> constraints (repo_key_fields) defined by the plugin writer. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Modify the Stages API so that the relative_path can be >>>>>>>>>>>>>> processed in the correct location – instead of >>>>>>>>>>>>>> “DeclarativeArtifact” it >>>>>>>>>>>>>> will likely need to go on “DeclarativeContent” >>>>>>>>>>>>>> >>>>>>>>>>>>>> Remove “location_href” from the RPM Package content model – >>>>>>>>>>>>>> it was never a true part of the RPM (file) metadata, it is >>>>>>>>>>>>>> derived from the >>>>>>>>>>>>>> repository metadata. So storing it as a part of the Content unit >>>>>>>>>>>>>> doesn’t >>>>>>>>>>>>>> entirely make sense. >>>>>>>>>>>>>> Alternatives >>>>>>>>>>>>>> >>>>>>>>>>>>>> In most cases, a content unit will have a single relative >>>>>>>>>>>>>> path for a content unit. Creating a general solution to solve a >>>>>>>>>>>>>> one-off >>>>>>>>>>>>>> problem is usually not a good idea. As an alternative, we could >>>>>>>>>>>>>> look at >>>>>>>>>>>>>> another solution for mirroring content. One example might be to >>>>>>>>>>>>>> create a >>>>>>>>>>>>>> new object (e.g. RpmRepoMirrorContentMapping) that maps content >>>>>>>>>>>>>> to specific >>>>>>>>>>>>>> paths within a repo or repo version. >>>>>>>>>>>>>> Questions >>>>>>>>>>>>>> >>>>>>>>>>>>>> - How do we handle this in pulp_file? How are content >>>>>>>>>>>>>> units identified in pulp_file without relative_path? >>>>>>>>>>>>>> - Checksum? >>>>>>>>>>>>>> - How was this problem handled in Pulp 2? >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Please weigh in if you have any input on potential problems >>>>>>>>>>>>>> with the proposal, potential alternate solutions, or other >>>>>>>>>>>>>> insights or >>>>>>>>>>>>>> questions! >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> Pulp-dev mailing list >>>>>>>>>>>>>> Pulp-dev@redhat.com >>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Pavel Picka >>>>>>>>>>>>>> Red Hat >>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> Pulp-dev mailing list >>>>>>>>>>>>>> Pulp-dev@redhat.com >>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev >>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> Pulp-dev mailing list >>>>>>>>>>>>>> Pulp-dev@redhat.com >>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev >>>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>> Pulp-dev mailing list >>>>>>>>>>> Pulp-dev@redhat.com >>>>>>>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev >>>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>> Pulp-dev mailing list >>>>>> Pulp-dev@redhat.com >>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev >>>>>> >>>>> _______________________________________________ >>>> Pulp-dev mailing list >>>> Pulp-dev@redhat.com >>>> https://www.redhat.com/mailman/listinfo/pulp-dev >>>> >>> _______________________________________________ >>> Pulp-dev mailing list >>> Pulp-dev@redhat.com >>> https://www.redhat.com/mailman/listinfo/pulp-dev >>> >> _______________________________________________ >> Pulp-dev mailing list >> Pulp-dev@redhat.com >> https://www.redhat.com/mailman/listinfo/pulp-dev >> >
_______________________________________________ Pulp-dev mailing list Pulp-dev@redhat.com https://www.redhat.com/mailman/listinfo/pulp-dev