On 06/28/2017 02:53 PM, Michael Hrivnak wrote: > I'm generally a big believer in this direction, as many of you know. :) I > think it is achievable, and from a > plugin writer perspective, would be very similar to what they do today. > Whereas in Pulp 2 a plugin creates a > symlink on disk, in Pulp 3 it would add an entry to a database table with > nearly the same information. > > More thoughts in-line. > > On Wed, Jun 28, 2017 at 2:27 PM, Jeff Ortel <[email protected] > <mailto:[email protected]>> wrote: > > I have been doing some thinking about pulp3 publishing with the following > goals in mind: > > - Eliminate symlinks. > - Eliminate need for each plugin to have its own Apache conf. > - Prevent orphaned content that is still published from being deleted. > > The main concept is to store the relationship between an artifact and a > URL in the DB instead of using the > filesystem. A `Publication` is created (and owned) by a publisher. Each > `Publication` is composed of (linked > to) many `artifacts`. The linkage contains the path component of the URL > which is used to locate the artifact > referenced by a URL. > > > A Publication should also be associated with a repo version. This is how > we'll be able to know at publish time: > > - did any content change? If not, skip the publish unless publisher config > changed... > - if content did change, what changes happened? When possible, do an > incremental publish based on this info. > > I think this is also the most natural way for a user to reason about whether > any given publication is current, > and if not, what differences it has from the repo contents. It gives the best > visibility into what content > they have, and what content is available to clients.
Makes sense. > > > > This covers artifacts as we know them today. But what about files > generated during publishing. A.K.A. > metadata? I propose that these files be stored as artifacts as well. > This requires an `Artifact` to be > redefined slightly. The definition would read more like: > > "A file associated with either stored or published content". > > Or, it would be even more generic, like: > > "A file contained within the pulp inventory that may be associated with > a content (unit) or publication." > > > There is enough difference between a file that's part of a unit vs. a file > that Pulp created during a publish > that I think they should be stored separately. I recognize that the tables > would be very similar, if not > identical, but I don't think we gain much from combining them. > > In practice I don't think we expect that a file would ever appear both as > part of a Content unit and as > something created by a publish task. They come from two very different > places, which gives them different > properties. Content likely has catalog entries, so those artifacts can be > re-retrieved at any point, even > transparently from the client perspective. Publication artifacts must be > created by a publish task; if one is > deleted, the whole publication should be re-created. These differences impact > how users may backup their data, > how replication may occur from one Pulp to another, caching behavior, etc. Good points. I'd considered separate tables but wasn't convinced until now. > > Signing is interesting to consider. We don't have a good plan yet for > supporting that, but we'll need it > sooner than later. A user will want to sign a specific publication, usually > by signing the primary metadata > file. PULP_MANIFEST is a good example where the same one could easily be > produced by multiple different > publishers and repos that happen to contain the same files. Think about > katello's multi-org use cases for > example. If that manifest gets signed, we want the signature associated with > this repo and this publication > only, and never to appear with a different repo that happens to have the same > content. So this signature needs > an association with the publication itself in addition to an association with > the file being signed. Maybe the > signature itself is just another file associated with the publication. > > Here is another small detail, but an important one. If we decide that an > artifact can be shared by multiple > content units, we're already getting into territory where deleting that > artifact must be done with care only > if it is not associated with any other content. There's a race here that > maybe we can overcome, but is very > important to stay on top of. If we also must check a second association type > to see if an artifact is > associated with content OR a publication, that makes the race more complex. > > > > > In any case, the relationship to a content (unit) becomes optional. > > Publications are not user facing. I think we can keep this as an > internal core concept. At least for the > MVP. > > The /var/lib/pulp/published directory goes away. > > General Flows: > > Publishing: "The publisher will compose a publication" > > 1. Publisher creates a publication using the plugin API. > > > Does the publication have information about its base path, authorization, > etc? We've relied on the publisher > for that sort of thing previously, but maybe the publisher should use those > settings as the defaults to impose > onto a publication. Wouldn't it be slick to promote a publication just by > changing the path it's made > available at. Or add a second path it's made available at... I considered storing the base path in the Publication. But I don't see how the query using the /path/ component of the URL could be indexed if the path is split between the Publication and the LinkedArtifact. Adding authorization information to the Publication sounds like a good idea. > > Speaking of which, at some point I really want to disconnect the production > of a publication from the serving > of it. A publication could be made available several different ways via http > (maybe several at the same time), > written to an ISO, rsync'd somewhere, torrented, actively pushed to some > other service, etc. There's already a > huge demand for the ability to publish once, and promote or otherwise > interact with that published thing. See > for example the clone distributor that katello made for yum repos. > > I'm worried about biting all of this off now. As you said, if it's possible > to just not expose this during the > MVP, that might be best for us to add on all the additional concepts later. > We should think through them > up-front though to make sure we don't paint ourselves in a corner. > > > 2. Publisher adds content artifacts to the publication. > 3. Publisher generates some metadata files in the working dir. > 4. Publisher adds the metadata files to the publication using the plugin > API. The artifacts can likely be > created behind the scenes by the plugin API. > 5. Publisher commits (publishes) the publication. The plugin API ensures > this is atomic. > > Client makes a GET request for content (or metadata): > > > 1. Request is routed to the content (WSGI) application (just like in > pulp2 for RPM). > 2. Query the `LinkedArtifact` table by URL path component to get the > artifact. > 3. forward the artifact storage path to: > <not stored locally> > streamer > <stored locally> > x-send > > > We may want different cache behavior. Files associated with units should not > change, so they can be cached for > a long time. Files produced by a publish (PULP_MANIFEST, repomd.xml, etc.) > can change at any time and should > perhaps not be cached at all. It'll be important to differentiate what type > of file is being returned. > > > 4. Done. > > > Tables: > ============================= > > Publication > id [PK] > publisher_id [FK] > created > schemes > > LinkedArtifact > id [PK] > publication_id [FK] > artifact_id [FK] > URL > > > I'd call this relative_path instead of URL This needs to be the full path component of the URL. Agreed URL isn't the most accurate name but for the purposes of conveying the idea, I wanted to be sure it was clear that it supported URL matching. > > > > > Examples Data: > ============================== > > Publisher: > ---------------- > publisher-1, ... > > > Artifact: > ---------------- > artifact-1, /var/lib/pulp/artifact/ff/9f373839d0/manifest > artifact-2, /var/lib/pulp/artifact/b1/37b64a8c83/tiger.img > > > Publication: > ---------------- > publication-1, publisher-1, 6-1-2017,.. > > > LinkedArtifact: > ---------------- > <id>, publication-1, artifact-1, /pulp/published/http/zoo/md/manifest > <id>, publication-1, artifact-2, /pulp/published/http/zoo/images/tiger.img > > > URLs would be: /pulp/published/(http|https)/<path> > > I think the core can have a single Apache configuration that defines 2 > directories. One HTTPS protected by > SSL/entitlement and the other is plain HTTP. > > > We should also have the ability to serve a publication with https but not > entitlement enforcement. Auth is a > separate layer in addition to SSL, and we should also prepare ourselves to > think about protecting published > data with other kinds of auth besides just client SSL certs. > > > > > Thoughts/Comments? > > > Thanks for starting this conversation! > > -- > > Michael Hrivnak > > Principal Software Engineer, RHCE > > Red Hat >
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Pulp-dev mailing list [email protected] https://www.redhat.com/mailman/listinfo/pulp-dev
