Irrespective to the MVP proposal, but to address one of Michael's comments: incremental repo creation is not simply for performance reasons. Because of how yum clients work (and I can only attest to this empirically since I have not read the code), yum repositories need to preserve a few generations of the (potentially compressed) xml files referenced in previous repomd.xml files. Otherwise, a yum client with a cached copy of repomd.xml may ask for a primary.xml.gz that got removed by a new publish, and things don't look pretty after a 404 on that. I think not including this possibility in the MVP will result in a *functional* regression.
But I absolutely love the idea of versioned repositories - see my attempt to address that with the https://github.com/sassoftware/pulp-snapshot distributor. Michael, on your point number 4 - in pulp 2 I was under the impression that the publisher is only responsible with creating a directory representation of a pulp repository (in the case of the yum distributor, it's a directory of a yum repository). Apache is responsible with serving that further, with or without additional authentication. Are you suggesting more than this behavior for pulp 3? On Mon, Apr 24, 2017 at 9:30 AM, Michael Hrivnak <[email protected]> wrote: > For publish, a plugin writer needs the ability to: > > - iterate through the units being published > - create new artifacts based on that iteration, or any other method it > sees fit > - make each unit's files available at a specific path either via http or > on a file store (for example, docker manifest files need to be served > directly by crane) > - make each newly-created artifact available at a specific path either via > http or on a file store (for example, metadata files for crane don't get > served via http) > > Optimizations in Pulp 2 further allow a plugin writer to read artifacts > created by a previous publication. For example, the rpm plugin uses this to > quickly add a few entries to an XML file instead of completely re-creating > it. This may not strictly be required for the MVP, but its absence would > likely create a substantial performance regression. Similarly, this > requires the ability to determine which units have been added and removed > since the last publish. See versioned repos below... > > As for making copies of unit files, I think if Pulp did that for each > publish, it would become effectively unusable for a lot of users. At best, > it would double the required storage, but for many users would be much > worse. It would also greatly increase the required time to perform a > publish. As such, I think the MVP should continue to store just one copy of > each unit, including its files, similar to Pulp 2. How those files are > referenced is an area we could definitely improve though. From a plugin > writer's perspective, it should be enough to tell the platform "make file X > available at location Y", and not worry about whether copies, symlinks, or > any other referencing method is being employed. > > As for recording which units are available with a publication... If we > implement versioned repositories, then each repo version would be an > addressable and immutable object with references to units. A publication > would naturally then reference a repo version. How exactly we model the > repo versions could go several ways, but they all include a single > addressable object as far as I envision it. I promise I'll cook up a > specific proposal in the near future. ;) > > > > On Mon, Apr 24, 2017 at 7:31 AM, Mihai Ibanescu <[email protected]> > wrote: > >> Jeff, >> >> A few comments to your strawman: >> >> * What is an artifact? If it is a database model, then why not call it a >> unit if that's what it's called everywhere else in the code? >> * How would you deal with metadata-only units that don't have a file >> representation, but show up in some kind of metadata (e.g. package groups / >> errata). associate() doesn't seem to give me that. >> * For that matter, how would you deal with files that are not >> representations of units, but new artifacts? (e.g. repomd.xml and the >> like). It feels like it can be possible by extending my commit() with >> writing the metadata and then calling the parent class' commit() (which >> does the atomic publish), but I think that's not pretty. >> >> >> On Fri, Apr 21, 2017 at 5:09 PM, Jeff Ortel <[email protected]> wrote: >> >>> I like this Brian and want to take it one step further. I think there >>> is value in abstracting how a >>> publication is composed. Files like metadata need to be composed by the >>> publisher (as needed) in the >>> working_dir then "added" to the publication. Artifacts could be >>> "associated" to the publication and the >>> platform determines how this happens (symlinks/in the DB). >>> >>> Assuming the Publisher is instantiated with a 'working_dir' attribute. >>> >>> --------------------------------------- >>> >>> Something like this to kick around: >>> >>> >>> class Publication: >>> """ >>> The Publication provided by the plugin API. >>> >>> Examples: >>> >>> A crude example with lots of hand waving. >>> >>> In Publisher.publish() >>> >>> >>> >>> >>> publication = Publication(self.working_dir) >>> >>> >>> >>> # Artifacts >>> >>> for artifact in []: # artifacts >>> >>> path = ' <determine relative path>' >>> >>> publication.associate(artifact, path) >>> >>> >>> >>> # Metadata created in self.staging_dir <here>. >>> >>> >>> >>> publication.add('repodata/primary.xml') >>> >>> publication.add('repodata/others.xml') >>> >>> publication.add('repodata/repomd.xml') >>> >>> >>> >>> # - OR - >>> >>> >>> >>> publication.add('repodata/') >>> >>> >>> >>> publication.commit() >>> """ >>> >>> def __init__(self, staging_dir): >>> """ >>> Args: >>> staging_dir: Absolute path to where publication is staged. >>> """ >>> self.staging_dir = staging_dir >>> >>> def associate(self, artifact, path): >>> """ >>> Associate an artifact to the publication. >>> This could result in creating a symlink in the staging directory >>> or (later) creating a record in the db. >>> >>> Args: >>> artifact: A content artifact >>> path: Relative path within the staging directory AND >>> eventually >>> within the published URL. >>> """ >>> >>> def add(self, path): >>> """ >>> Add a file within the staging directory to the publication by >>> relative path. >>> >>> Args: >>> path: Relative path within the staging directory AND >>> eventually within >>> the published URL. When *path* is a directory, all >>> files >>> within the directory are added. >>> """ >>> >>> def commit(self): >>> """ >>> When committed, the publication is atomically published. >>> """ >>> # atomic magic >>> >>> >>> >>> >>> >>> On 04/19/2017 10:16 AM, Brian Bouterse wrote: >>> > I was thinking about the design here and I wanted to share some >>> thoughts. >>> > >>> > For the MVP, I think a publisher implemented by a plugin developer >>> would write all files into the working >>> > directory and the platform will "atomically publish" that data into >>> the location configured by the repository. >>> > The "atomic publish" aspect would copy/stage the files in a permanent >>> location but would use a single symlink >>> > to the top level folder to go live with the data. This would make >>> atomic publication the default behavior. >>> > This runs after the publish() implemented by the plugin developer >>> returns, after it has written all of its >>> > data to the working dir. >>> > >>> > Note that ^ allows for the plugin writer to write the actual contents >>> of files in the working directory >>> > instead of symlinks, causing Pulp to duplicate all content on disk >>> with every publish. That would be a >>> > incredibly inefficient way to write a plugin but it's something the >>> platform would not prevent in any explicit >>> > way. I'm not sure if this is something we should improve on or not. >>> > >>> > At a later point, we could add in the incremental publish maybe as a >>> method on a Publisher called >>> > incremental_publish() which would only be called if the previous >>> publish only had units added. >>> > >>> > >>> > >>> > On Mon, Apr 17, 2017 at 4:22 PM, Brian Bouterse <[email protected] >>> <mailto:[email protected]>> wrote: >>> > >>> > For plugin writers who are writing a publisher for Pulp3, what do >>> they need to handle during publishing >>> > versus platform? To make a comparison against sync, the "Download >>> API" and "Changesets" [0] allows the >>> > plugin writer to tell platform about a remote piece of content. >>> Then platform handles creating the unit, >>> > fetching it, and saving it. Will there be a similar API to support >>> publishing to ease the burden of a >>> > plugin writer? Also will this allow platform to have a structured >>> knowledge of a publication with Pulp3? >>> > >>> > I wanted to try to characterize the problem statement as two >>> separate questions: >>> > >>> > 1) How will units be recorded to allow platform to know which >>> units comprise a specific publish? >>> > 2) What are plugin writer's needs at publish time, and what >>> repetitive tasks could be moved to platform? >>> > >>> > As a quick recalling of how Pulp2 works. Each publisher would >>> write files into the working directory and >>> > then they would get moved into their permanent home. Also there is >>> the incrementalPublisher base machinery >>> > which allowed for an additive publication which would use the >>> previous publish and was "faster". Finally >>> > in Pulp2, the only record of a publication are the symlinks on the >>> filesystem. >>> > >>> > I have some of my own ideas on these things, but I'll start the >>> conversation. >>> > >>> > [0]: https://github.com/pulp/pulp/pull/2876 < >>> https://github.com/pulp/pulp/pull/2876> >>> > >>> > -Brian >>> > >>> > >>> > >>> > >>> > _______________________________________________ >>> > Pulp-dev mailing list >>> > [email protected] >>> > https://www.redhat.com/mailman/listinfo/pulp-dev >>> > >>> >>> >>> _______________________________________________ >>> Pulp-dev mailing list >>> [email protected] >>> https://www.redhat.com/mailman/listinfo/pulp-dev >>> >>> >> >> _______________________________________________ >> Pulp-dev mailing list >> [email protected] >> https://www.redhat.com/mailman/listinfo/pulp-dev >> >> > > > -- > > Michael Hrivnak > > Principal Software Engineer, RHCE > > Red Hat > > _______________________________________________ > Pulp-dev mailing list > [email protected] > https://www.redhat.com/mailman/listinfo/pulp-dev > >
_______________________________________________ Pulp-dev mailing list [email protected] https://www.redhat.com/mailman/listinfo/pulp-dev
