Updated the redmine story https://pulp.plan.io/issues/6353 with this
proposal. Feel free to comment there as well.
If there are no objections, I'll start on a PoC once I finish the items I'm
working on right now.

Thanks,
Tanya

On Wed, Oct 21, 2020 at 8:48 PM Tatiana Tereshchenko <ttere...@redhat.com>
wrote:

> TL;DR: An attempt to propose the least invasive option to solve the case
> when remote repository metadata needs to be mirrored. Please provide
> feedback if you are interested in the outcome.
>
> There have been multiple attempts and discussions to solve the
> relative_path problem in a general way which covers multiple use cases.
> They all look very invasive and only possible to be done in Pulp 4+ due to
> the amount and significance of changes that needs to be made, to the data
> models and/or to the API.
>
> The following proposal solves only this use case: As a user, I can mirror
> remote repository metadata as is.
> An additional goal is to avoid backward incompatible changes and ideally
> leave a way for further improvement to solve the problem in a more general
> way.
> (The following proposal does NOT solve a use case: As a user, I can have
> the same content under different relative paths in any repository.)
>
> Proposal:
> - Have a way to distinguish between repositories with managed content and
> with the exact mirror (e.g. create a repository with exact_mirror=True or a
> new dedicated repository type)
>  - For such repos, create a publication at sync time (includes published
> artifacts and metadata).
>  - For such repos, publish is no-op and always returns the existing
> publication for the requested repo version.
>  - For such repos, no modifications are allowed except the sync in mirror
> mode. (At least for RPM plugin, I believe we can't allow discrepancies
> between metadata and content in a repo, especially if some content is
> removed.)
>
> Pros:
>  - non-invasive, only additive model changes
>  - can be implemented in a plugin which needs it or it can be moved to the
> pulpcore if it allows plugin input at certain points.
>  - leaves a way for further improvement to handle a more general case, see
> the full proposal here
> https://hackmd.io/02KBjCD3Q0WP7p4ALwzhJw#relative_path-in-PublishedArtifact-only
>
> Cons:
>  - doesn't solve the problem of various relative paths for the same
> content in general way
>  - a separate code path (at times) to handle this type of repositories.
>
> For reference:
>  - hackmd doc with all the considered proposals and the summary of the
> potentially valid ones https://hackmd.io/02KBjCD3Q0WP7p4ALwzhJw
>  - pulpcon video with the discussion of the proposals
> https://www.youtube.com/watch?v=7IzxAQYr5-I
>
> Thanks,
> Tanya
>
> On Thu, May 7, 2020 at 2:07 PM Brian Bouterse <bmbou...@redhat.com> wrote:
>
>> I agree with that problem statement. pulp_file may want to have the same
>> Content at two different paths in different RepositoryVersions (or even the
>> same RepositoryVersion). Without this capability a user could never "move"
>> where content lives in a RepositoryVersion if its already been placed in
>> any other RepositoryVersion.
>>
>> Additionally pulp_maven may need to sync two repositories in the wild
>> that already contain the same content in two locations. I offer this as
>> example not to pile-on, but because it's a multi-content artifact which I
>> believe we will need to consider also as we work towards a solution.
>>
>> I've been spending time on developing a solution, but it needs more work
>> so it's not ready yet. Also other katello and galaxy_ng work continues to
>> pre-empt this, so it could take a while.
>>
>> On Thu, May 7, 2020 at 3:39 AM Matthias Dellweg <mdell...@redhat.com>
>> wrote:
>>
>>> > Users need to be able to store the same content unit at different
>>> relative paths in different repository versions. This problem is not unique
>>> to the RPM plugin. Do we agree about that?
>>> Yes, we agree. In pulp_deb relative_path is part of the contents
>>> natural_key to circumvent this problem. So this creates two content units
>>> that only differ in relativ_path. At least they share the artifact.
>>>
>>> On Thu, May 7, 2020 at 2:06 AM Dennis Kliban <dkli...@redhat.com> wrote:
>>>
>>>> I'd like to provide a little bit more context for my previous email by
>>>> going back to the original problem statement:
>>>>
>>>> On Wed, Apr 1, 2020 at 9:23 AM Daniel Alley <dal...@redhat.com> wrote:
>>>>
>>>>> Problem:
>>>>>
>>>>> Currently, a relative_path is tied to content in Pulp. This means that
>>>>> if a content unit exists in two places within a repository or across
>>>>> repositories, it has to be stored as two separate content units. This
>>>>> creates redundant data and potential confusion for users.
>>>>>
>>>>> As a specific example, we need to support mirroring content in
>>>>> pulp_rpm <https://pulp.plan.io/issues/6353>. Currently, for each
>>>>> location at which a single package is stored, we’ll need to create a
>>>>> content unit. We could end up with several records representing a single
>>>>> package. Users may be confused about why they see multiple records for a
>>>>> package and they may have trouble for example deciding which content unit
>>>>> to copy.
>>>>>
>>>> Users need to be able to store the same content unit at different
>>>> relative paths in different repository versions. This problem is not unique
>>>> to the RPM plugin. Do we agree about that?
>>>>
>>>> I've been working on a potential solution that solves this problem in a
>>>> document[0]. It is a complicated change and the document does not fully
>>>> capture the plan yet. Feedback and help on the design is welcome.
>>>>
>>>> [0] https://hackmd.io/02KBjCD3Q0WP7p4ALwzhJw?edit
>>>>
>>>>
>>>> On Mon, May 4, 2020 at 4:11 PM Dennis Kliban <dkli...@redhat.com>
>>>> wrote:
>>>>
>>>>> I've reached two conclusions while trying to formulate a solution:
>>>>>
>>>>> This problem needs to be solved at the repository version level.
>>>>> Repository membership needs to be tracked at the artifact level, and not
>>>>> content level as it is now.
>>>>>
>>>>> On Thu, Apr 30, 2020 at 1:11 PM Daniel Alley <dal...@redhat.com>
>>>>> wrote:
>>>>>
>>>>>> Cool, so the only difference is whether to try to store the
>>>>>> relationship in the DB, or leverage the fact that we already have the
>>>>>> metadata and just re-parse it.
>>>>>>
>>>>>> I know the latter approach has yet to be written up, but my concern
>>>>>> there is that adding another layer of indirection between "repository
>>>>>> version" and "content" is going to have an adverse impact on performance,
>>>>>> since it is already the most complex and demanding query we issue to the 
>>>>>> DB
>>>>>> and one of the most common and important.
>>>>>>
>>>>>> On Thu, Apr 30, 2020 at 12:50 PM David Davis <davidda...@redhat.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Yes but I was imagining the mapping would be stored not as Content
>>>>>>> but as a separate object. So we wouldn't use filename for the mapping
>>>>>>> (rather we'd use ContentArtifact pk) and  we wouldn't need to change
>>>>>>> ContentArtifact's relative_path at all. That said, I think your solution
>>>>>>> captures the idea though and is better in some ways.
>>>>>>>
>>>>>>> Changing the RepositoryContent model to point to ContentArtifacts
>>>>>>> and store relative_paths is probably the best and most correct solution 
>>>>>>> in
>>>>>>> theory. However, it's going to be painful to implement for both core and
>>>>>>> plugins.
>>>>>>>
>>>>>>> David
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Apr 30, 2020 at 12:33 PM Daniel Alley <dal...@redhat.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> @David Davis <davidda...@redhat.com>  so this proposal would go
>>>>>>>> something like this, correct?:
>>>>>>>>
>>>>>>>> * For the signed metadata / exact mirror use-case we need to store
>>>>>>>> the repository metadata itself as a content unit inside the
>>>>>>>> RepositoryVersion anyway (because the hash must be equal)
>>>>>>>> * Because we have this metadata lying around, we can reference it
>>>>>>>> at publish time to discover the appropriate 
>>>>>>>> PublishedArtifact.relative_path
>>>>>>>>    * Create a map of "filename" -> "location_href" and look up the
>>>>>>>> filename of each RPM package to find the appropriate path
>>>>>>>>    * This should be pretty fast for the RPM plugin since
>>>>>>>> createrepo_c is doing all the hard work
>>>>>>>> * Data migration to ensure ContentArtifact.relative_path is only
>>>>>>>> storing the filename (and I would suggest we also change the name to
>>>>>>>> "filename")
>>>>>>>> * If metadata isn't present in the RepositoryVersion, then just
>>>>>>>> tweak the PublishedArtifact.relative_path so that it uses whichever our
>>>>>>>> default repo layout is
>>>>>>>>
>>>>>>>> On Tue, Apr 28, 2020 at 11:41 AM David Davis <davidda...@redhat.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Yes, that's correct. During our meeting we discussed two options:
>>>>>>>>> the first was to extend RepositoryContent to store relative path per
>>>>>>>>> ContentArtifact as storing a relative_path per Content won't work for
>>>>>>>>> multi-Artifact Content units.
>>>>>>>>>
>>>>>>>>> An alternative that I pitched was to have plugins (or maybe even
>>>>>>>>> core someday) store this information outside RepositoryContent and 
>>>>>>>>> then use
>>>>>>>>> this information during publishing to set relative_path on
>>>>>>>>> PublishedArtifacts. We'd have to modify the content app if we wanted 
>>>>>>>>> to
>>>>>>>>> support pass through publications but I think asking plugins to use
>>>>>>>>> published artifacts in this case is warranted. That said, I don't 
>>>>>>>>> think
>>>>>>>>> anyone else was keen on this idea though.
>>>>>>>>>
>>>>>>>>> David
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Apr 28, 2020 at 10:30 AM Matthias Dellweg <
>>>>>>>>> mdell...@redhat.com> wrote:
>>>>>>>>>
>>>>>>>>>> That is only used for passthrough publication afaik. If you
>>>>>>>>>> publish each content unit "by hand", you create a new relative path 
>>>>>>>>>> for
>>>>>>>>>> each published artifact. That is, why it can be empty and still the 
>>>>>>>>>> content
>>>>>>>>>> can be published.
>>>>>>>>>>
>>>>>>>>>> On Tue, Apr 28, 2020 at 4:09 PM Daniel Alley <dal...@redhat.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> We realized in our discussion that the original proposal
>>>>>>>>>>> described in my email will not work, because "relative_path" 
>>>>>>>>>>> ultimately
>>>>>>>>>>> describes the path of the published *artifacts* (not content),
>>>>>>>>>>> and for content types with multiple artifacts, storing this 
>>>>>>>>>>> information in
>>>>>>>>>>> a field on RepositoryContent would not be possible.
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Apr 27, 2020 at 6:08 PM Daniel Alley <dal...@redhat.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> There is a video call scheduled to discuss this issue tomorrow
>>>>>>>>>>>> (Tuesday April 28th) at 13:30 UTC (please convert to your local 
>>>>>>>>>>>> time).
>>>>>>>>>>>> https://meet.google.com/scy-csbx-qiu
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, Apr 25, 2020 at 7:02 AM David Davis <
>>>>>>>>>>>> davidda...@redhat.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I had a chance to think about this some more yesterday and
>>>>>>>>>>>>> wanted to email out my thoughts. I also think that this change 
>>>>>>>>>>>>> sounds scary
>>>>>>>>>>>>> and will have a big impact on plugin writers so I thought of a 
>>>>>>>>>>>>> couple
>>>>>>>>>>>>> alternatives:
>>>>>>>>>>>>>
>>>>>>>>>>>>> First, we could add a relative_path field to RepositoryContent
>>>>>>>>>>>>> instead of moving it there. This would be an optional field. It 
>>>>>>>>>>>>> would be up
>>>>>>>>>>>>> to plugins to manage this field and they would still need to 
>>>>>>>>>>>>> populate the
>>>>>>>>>>>>> relative_path field on ContentArtifact. But plugins could use 
>>>>>>>>>>>>> this optional
>>>>>>>>>>>>> field to store relative paths per repository and then use this 
>>>>>>>>>>>>> field when
>>>>>>>>>>>>> generating publications.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The second alternative is one that is already laid out in the
>>>>>>>>>>>>> original email but to call it out again: it would be to not solve 
>>>>>>>>>>>>> this in
>>>>>>>>>>>>> pulpcore. RPM would create its own object that would map content 
>>>>>>>>>>>>> in a
>>>>>>>>>>>>> repository to relative_paths.
>>>>>>>>>>>>>
>>>>>>>>>>>>> David
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Apr 21, 2020 at 9:22 AM Quirin Pamp <p...@atix.de>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am not currently very well versed in the classes involved,
>>>>>>>>>>>>>> but moving relative_path around sounds slightly scary with the 
>>>>>>>>>>>>>> potential to
>>>>>>>>>>>>>> break things.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As such, I would be interested to be kept in the loop as this
>>>>>>>>>>>>>> moves forward. (Mailing list once there is some movement is 
>>>>>>>>>>>>>> entirely
>>>>>>>>>>>>>> sufficient 😉)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Quirin Pamp
>>>>>>>>>>>>>> ------------------------------
>>>>>>>>>>>>>> *From:* pulp-dev-boun...@redhat.com <
>>>>>>>>>>>>>> pulp-dev-boun...@redhat.com> on behalf of Ina Panova <
>>>>>>>>>>>>>> ipan...@redhat.com>
>>>>>>>>>>>>>> *Sent:* 21 April 2020 14:07:13
>>>>>>>>>>>>>> *To:* Daniel Alley <dal...@redhat.com>
>>>>>>>>>>>>>> *Cc:* Pulp-dev <pulp-dev@redhat.com>
>>>>>>>>>>>>>> *Subject:* Re: [Pulp-dev] the "relative path" problem
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Daniel,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> how about setting up a meeting and brainstorm the
>>>>>>>>>>>>>> alternatives, pros/cons there?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --------
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Ina Panova
>>>>>>>>>>>>>> Senior Software Engineer| Pulp| Red Hat Inc.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> "Do not go where the path may lead,
>>>>>>>>>>>>>>  go instead where there is no path and leave a trail."
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Apr 17, 2020 at 5:57 PM Daniel Alley <
>>>>>>>>>>>>>> dal...@redhat.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Bump, this item needs to move forwards soon.  Does anyone
>>>>>>>>>>>>>> have any thoughts?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Apr 1, 2020 at 9:40 AM Pavel Picka <ppi...@redhat.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>> I'd like to add one more question to this topic. Do you think
>>>>>>>>>>>>>> it is a blocker for PRs [0] & [1] as by testing [2] this 
>>>>>>>>>>>>>> features I haven't
>>>>>>>>>>>>>> run into real world example where two really same name packages 
>>>>>>>>>>>>>> appears.
>>>>>>>>>>>>>> I think this is a 'must have' feature but until we
>>>>>>>>>>>>>> solve/decide it we can have two features working may with 
>>>>>>>>>>>>>> warning in docs
>>>>>>>>>>>>>> for users that can happen in some 'special' repositories.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> To follow topic directly I like proposed move to
>>>>>>>>>>>>>> 'RepositoryContent' and add it to its uniqueness constraint (if I
>>>>>>>>>>>>>> understand well).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [0] https://github.com/pulp/pulp_rpm/pull/1657
>>>>>>>>>>>>>> [1] https://github.com/pulp/pulp_rpm/pull/1642
>>>>>>>>>>>>>> [2] tested with centos 7, 8, opensuse and SLE repositories
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Apr 1, 2020 at 3:22 PM Daniel Alley <
>>>>>>>>>>>>>> dal...@redhat.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We'd like to start a discussion on the "relative path
>>>>>>>>>>>>>> problem" identified recently.
>>>>>>>>>>>>>> Problem:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Currently, a relative_path is tied to content in Pulp. This
>>>>>>>>>>>>>> means that if a content unit exists in two places within a 
>>>>>>>>>>>>>> repository or
>>>>>>>>>>>>>> across repositories, it has to be stored as two separate content 
>>>>>>>>>>>>>> units.
>>>>>>>>>>>>>> This creates redundant data and potential confusion for users.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As a specific example, we need to support mirroring content
>>>>>>>>>>>>>> in pulp_rpm <https://pulp.plan.io/issues/6353>. Currently,
>>>>>>>>>>>>>> for each location at which a single package is stored, we’ll 
>>>>>>>>>>>>>> need to create
>>>>>>>>>>>>>> a content unit. We could end up with several records 
>>>>>>>>>>>>>> representing a single
>>>>>>>>>>>>>> package. Users may be confused about why they see multiple 
>>>>>>>>>>>>>> records for a
>>>>>>>>>>>>>> package and they may have trouble for example deciding which 
>>>>>>>>>>>>>> content unit
>>>>>>>>>>>>>> to copy.
>>>>>>>>>>>>>> Proposed Solution:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Move “relative_path” from its current location on
>>>>>>>>>>>>>> ContentArtifact, to RepositoryContent. This will require a 
>>>>>>>>>>>>>> sizable data
>>>>>>>>>>>>>> migration. It is possibly the case that in rare cases, 
>>>>>>>>>>>>>> repository versions
>>>>>>>>>>>>>> may change slightly due to deduplication.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> A repository-version-wide uniqueness constraint will be
>>>>>>>>>>>>>> present on “relative_path”, independently of any other 
>>>>>>>>>>>>>> repository uniquness
>>>>>>>>>>>>>> constraints (repo_key_fields) defined by the plugin writer.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Modify the Stages API so that the relative_path can be
>>>>>>>>>>>>>> processed in the correct location – instead of 
>>>>>>>>>>>>>> “DeclarativeArtifact” it
>>>>>>>>>>>>>> will likely need to go on “DeclarativeContent”
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Remove “location_href” from the RPM Package content model –
>>>>>>>>>>>>>> it was never a true part of the RPM (file) metadata, it is 
>>>>>>>>>>>>>> derived from the
>>>>>>>>>>>>>> repository metadata. So storing it as a part of the Content unit 
>>>>>>>>>>>>>> doesn’t
>>>>>>>>>>>>>> entirely make sense.
>>>>>>>>>>>>>> Alternatives
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In most cases, a content unit will have a single relative
>>>>>>>>>>>>>> path for a content unit. Creating a general solution to solve a 
>>>>>>>>>>>>>> one-off
>>>>>>>>>>>>>> problem is usually not a good idea. As an alternative, we could 
>>>>>>>>>>>>>> look at
>>>>>>>>>>>>>> another solution for mirroring content. One example might be to 
>>>>>>>>>>>>>> create a
>>>>>>>>>>>>>> new object (e.g. RpmRepoMirrorContentMapping) that maps content 
>>>>>>>>>>>>>> to specific
>>>>>>>>>>>>>> paths within a repo or repo version.
>>>>>>>>>>>>>> Questions
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    - How do we handle this in pulp_file? How are content
>>>>>>>>>>>>>>    units identified in pulp_file without relative_path?
>>>>>>>>>>>>>>       - Checksum?
>>>>>>>>>>>>>>       - How was this problem handled in Pulp 2?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Please weigh in if you have any input on potential problems
>>>>>>>>>>>>>> with the proposal, potential alternate solutions, or other 
>>>>>>>>>>>>>> insights or
>>>>>>>>>>>>>> questions!
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> Pulp-dev mailing list
>>>>>>>>>>>>>> Pulp-dev@redhat.com
>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Pavel Picka
>>>>>>>>>>>>>> Red Hat
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> Pulp-dev mailing list
>>>>>>>>>>>>>> Pulp-dev@redhat.com
>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> Pulp-dev mailing list
>>>>>>>>>>>>>> Pulp-dev@redhat.com
>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Pulp-dev mailing list
>>>>>>>>>>> Pulp-dev@redhat.com
>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>> Pulp-dev mailing list
>>>>>> Pulp-dev@redhat.com
>>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>>>
>>>>> _______________________________________________
>>>> Pulp-dev mailing list
>>>> Pulp-dev@redhat.com
>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>
>>> _______________________________________________
>>> Pulp-dev mailing list
>>> Pulp-dev@redhat.com
>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>
>> _______________________________________________
>> Pulp-dev mailing list
>> Pulp-dev@redhat.com
>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>
>
_______________________________________________
Pulp-dev mailing list
Pulp-dev@redhat.com
https://www.redhat.com/mailman/listinfo/pulp-dev

Reply via email to