On 06/28/2017 02:53 PM, Michael Hrivnak wrote:
> I'm generally a big believer in this direction, as many of you know. :) I 
> think it is achievable, and from a
> plugin writer perspective, would be very similar to what they do today. 
> Whereas in Pulp 2 a plugin creates a
> symlink on disk, in Pulp 3 it would add an entry to a database table with 
> nearly the same information.
> 
> More thoughts in-line.
> 
> On Wed, Jun 28, 2017 at 2:27 PM, Jeff Ortel <[email protected] 
> <mailto:[email protected]>> wrote:
> 
>     I have been doing some thinking about pulp3 publishing with the following 
> goals in mind:
> 
>     - Eliminate symlinks.
>     - Eliminate need for each plugin to have its own Apache conf.
>     - Prevent orphaned content that is still published from being deleted.
> 
>     The main concept is to store the relationship between an artifact and a 
> URL in the DB instead of using the
>     filesystem.  A `Publication` is created (and owned) by a publisher.  Each 
> `Publication` is composed of (linked
>     to) many `artifacts`.  The linkage contains the path component of the URL 
> which is used to locate the artifact
>     referenced by a URL.
> 
> 
> A Publication should also be associated with a repo version. This is how 
> we'll be able to know at publish time:
> 
> - did any content change? If not, skip the publish unless publisher config 
> changed...
> - if content did change, what changes happened? When possible, do an 
> incremental publish based on this info.
> 
> I think this is also the most natural way for a user to reason about whether 
> any given publication is current,
> and if not, what differences it has from the repo contents. It gives the best 
> visibility into what content
> they have, and what content is available to clients.

Makes sense.

>  
> 
> 
>     This covers artifacts as we know them today.  But what about files 
> generated during publishing.  A.K.A.
>     metadata?  I propose that these files be stored as artifacts as well.  
> This requires an `Artifact` to be
>     redefined slightly.  The definition would read more like:
> 
>       "A file associated with either stored or published content".
> 
>     Or, it would be even more generic, like:
> 
>       "A file contained within the pulp inventory that may be associated with 
> a content (unit) or publication."
> 
> 
> There is enough difference between a file that's part of a unit vs. a file 
> that Pulp created during a publish
> that I think they should be stored separately. I recognize that the tables 
> would be very similar, if not
> identical, but I don't think we gain much from combining them.
> 
> In practice I don't think we expect that a file would ever appear both as 
> part of a Content unit and as
> something created by a publish task. They come from two very different 
> places, which gives them different
> properties. Content likely has catalog entries, so those artifacts can be 
> re-retrieved at any point, even
> transparently from the client perspective. Publication artifacts must be 
> created by a publish task; if one is
> deleted, the whole publication should be re-created. These differences impact 
> how users may backup their data,
> how replication may occur from one Pulp to another, caching behavior, etc.

Good points.

I'd considered separate tables but wasn't convinced until now.

> 
> Signing is interesting to consider. We don't have a good plan yet for 
> supporting that, but we'll need it
> sooner than later. A user will want to sign a specific publication, usually 
> by signing the primary metadata
> file. PULP_MANIFEST is a good example where the same one could easily be 
> produced by multiple different
> publishers and repos that happen to contain the same files. Think about 
> katello's multi-org use cases for
> example. If that manifest gets signed, we want the signature associated with 
> this repo and this publication
> only, and never to appear with a different repo that happens to have the same 
> content. So this signature needs
> an association with the publication itself in addition to an association with 
> the file being signed. Maybe the
> signature itself is just another file associated with the publication.
> 
> Here is another small detail, but an important one. If we decide that an 
> artifact can be shared by multiple
> content units, we're already getting into territory where deleting that 
> artifact must be done with care only
> if it is not associated with any other content. There's a race here that 
> maybe we can overcome, but is very
> important to stay on top of. If we also must check a second association type 
> to see if an artifact is
> associated with content OR a publication, that makes the race more complex.
> 
>  
> 
> 
>     In any case, the relationship to a content (unit) becomes optional.
> 
>     Publications are not user facing.  I think we can keep this as an 
> internal core concept.  At least for the
>     MVP.
> 
>     The /var/lib/pulp/published directory goes away.
> 
>     General Flows:
> 
>     Publishing: "The publisher will compose a publication"
> 
>     1. Publisher creates a publication using the plugin API.
> 
> 
> Does the publication have information about its base path, authorization, 
> etc? We've relied on the publisher
> for that sort of thing previously, but maybe the publisher should use those 
> settings as the defaults to impose
> onto a publication. Wouldn't it be slick to promote a publication just by 
> changing the path it's made
> available at. Or add a second path it's made available at...

I considered storing the base path in the Publication. But I don't see how the 
query using the /path/
component of the URL could be indexed if the path is split between the 
Publication and the LinkedArtifact.

Adding authorization information to the Publication sounds like a good idea.

> 
> Speaking of which, at some point I really want to disconnect the production 
> of a publication from the serving
> of it. A publication could be made available several different ways via http 
> (maybe several at the same time),
> written to an ISO, rsync'd somewhere, torrented, actively pushed to some 
> other service, etc. There's already a
> huge demand for the ability to publish once, and promote or otherwise 
> interact with that published thing. See
> for example the clone distributor that katello made for yum repos.
> 
> I'm worried about biting all of this off now. As you said, if it's possible 
> to just not expose this during the
> MVP, that might be best for us to add on all the additional concepts later. 
> We should think through them
> up-front though to make sure we don't paint ourselves in a corner.
>  
> 
>     2. Publisher adds content artifacts to the publication.
>     3. Publisher generates some metadata files in the working dir.
>     4. Publisher adds the metadata files to the publication using the plugin 
> API.  The artifacts can likely be
>     created behind the scenes by the plugin API.
>     5. Publisher commits (publishes) the publication.  The plugin API ensures 
> this is atomic.
> 
>     Client makes a GET request for content (or metadata):
> 
> 
>     1. Request is routed to the content (WSGI) application (just like in 
> pulp2 for RPM).
>     2. Query the `LinkedArtifact` table by URL path component to get the 
> artifact.
>     3. forward the artifact storage path to:
>        <not stored locally>
>            streamer
>        <stored locally>
>            x-send
> 
> 
> We may want different cache behavior. Files associated with units should not 
> change, so they can be cached for
> a long time. Files produced by a publish (PULP_MANIFEST, repomd.xml, etc.) 
> can change at any time and should
> perhaps not be cached at all. It'll be important to differentiate what type 
> of file is being returned.
>  
> 
>     4. Done.
> 
> 
>     Tables:
>     =============================
> 
>     Publication
>       id [PK]
>       publisher_id [FK]
>       created
>       schemes
> 
>     LinkedArtifact
>       id [PK]
>       publication_id [FK]
>       artifact_id [FK]
>       URL
> 
> 
> I'd call this relative_path instead of URL

This needs to be the full path component of the URL.  Agreed URL isn't the most 
accurate name but for the
purposes of conveying the idea, I wanted to be sure it was clear that it 
supported URL matching.

>  
> 
> 
> 
>     Examples Data:
>     ==============================
> 
>     Publisher:
>     ----------------
>     publisher-1, ...
> 
> 
>     Artifact:
>     ----------------
>     artifact-1, /var/lib/pulp/artifact/ff/9f373839d0/manifest
>     artifact-2, /var/lib/pulp/artifact/b1/37b64a8c83/tiger.img
> 
> 
>     Publication:
>     ----------------
>     publication-1, publisher-1, 6-1-2017,..
> 
> 
>     LinkedArtifact:
>     ----------------
>     <id>, publication-1, artifact-1, /pulp/published/http/zoo/md/manifest
>     <id>, publication-1, artifact-2, /pulp/published/http/zoo/images/tiger.img
> 
> 
>     URLs would be: /pulp/published/(http|https)/<path>
> 
>     I think the core can have a single Apache configuration that defines 2 
> directories.  One HTTPS protected by
>     SSL/entitlement and the other is plain HTTP.
> 
> 
> We should also have the ability to serve a publication with https but not 
> entitlement enforcement. Auth is a
> separate layer in addition to SSL, and we should also prepare ourselves to 
> think about protecting published
> data with other kinds of auth besides just client SSL certs.
>  
> 
> 
> 
>     Thoughts/Comments?
> 
> 
> Thanks for starting this conversation! 
> 
> -- 
> 
> Michael Hrivnak
> 
> Principal Software Engineer, RHCE 
> 
> Red Hat
> 

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
Pulp-dev mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/pulp-dev

Reply via email to