I think the use case you outline about adding/removing a small subset of units is compelling as I imagine most always a new version will only add or remove a small subset of units to the latest version.
The performance concerns around option 1 is worth a closer look. Katello versions its content in a manner similar to the first option; it has been dealing with hundreds of millions of associations between versioned repos and content, and Postgresql has never been a problem (usually it’s MongoDB). However, I’d like to run some benchmarks to maybe confirm for sure whether it’ll be a problem. I talked to @dkliban and @bmbouter about this and we came up with an outline of how to maybe test managing one billion association records: http://pad-theforeman.rhcloud.com/p/pulp-postgresql-benchmarks I’m planning on coding this up tomorrow in a django console script and seeing how scaling up from 0 to 1 billion records affects Postgres’ performance. David On Tue, Dec 12, 2017 at 12:26 PM, Michael Hrivnak <[email protected]> wrote: > I expect both options to have equal ease of use for plugin writers. > > In both cases, I would expect the RepositoryVersion object to have a > "content" attribute that returns a QuerySet. That's what the PR does > currently, and the other approach could easily do the same. > > For adding and removing content, most plugins will let the core do that > for them by using changesets. Any plugins that choose the DIY approach will > do one of the following depending on which option is chosen: > > my_version.content.add(piece_of_content) > my_version.add_content(piece_of_content) > > I don't think either places a burden on the plugin writer. > > If option 1 is chosen, some thought will be needed around where/when all > the new relationships get made between a new version and its content. Would > the core create an empty version and expect the plugin to fully populate it > each time? Or would the core create a new version with the same content set > as its predecessor, and then let the plugin add/remove as necessary? > > As for comparing versions, option 2 makes that very easy. Tracking the > changes directly makes it easy to report on those changes quickly and > efficiently. > > For background, option 2 was created to accommodate the most common use > case, and the one where our users have proven most performance-sensitive: > after an initial large sync of a repo, additional content trickles in as a > series of small changes (think bug fixes on a RHEL release). The changes > need to be fast to write (during sync) and fast to read (incremental > publish, incremental applicability calculation, etc). Either approach will > likely work fine on a lightly-loaded system. But in a heavily-loaded > environment similar to where we see Pulp 2 often running, you likely would > see a meaningful difference between 10 inserts and 10,000 inserts. > > The other motivation was the issue of scale. Postgresql is a great > database, but lots of data is lots of data. Consider a user with 10 repos, > 10k content units in each, and 10 versions of each. That's a very small use > case, and already would be 1M associations under option 1. As any of those > numbers increase, you quickly get to hundreds of millions of associations > for even a medium-sized deployment, which can have real impact on query > performance, index size (you want your index in RAM when possible), index > updates, not to mention the time it takes for a database backup (or > restore!). So if you want to go with option 1, I encourage seeking > realistic performance expectations first. > > I'm happy to make the last few updates to the PR for option 2, but I > suppose I should wait for this discussion to come to a conclusion first. > Keep me posted if you want to green-light option 2. > > On Tue, Dec 12, 2017 at 9:28 AM, Jeremy Audet <[email protected]> wrote: > >> >>>> >> Gotcha. So, if I want to see whether or not some given piece of content >> is in a repository, then I need to iterate through every RepositoryContent >> related to a given RepositoryVersion, and check to see if any have a >> non-null version_added and a null version_removed, right? >> >> > That's already done and isolated in one place. You would just access > myversion.content() to get a queryset, and use it like any other. There > should be no need for a plugin writer to see or understand the join logic, > regardless of what that logic is. > > -- > > Michael Hrivnak > > Principal Software Engineer, RHCE > > Red Hat > > _______________________________________________ > Pulp-dev mailing list > [email protected] > https://www.redhat.com/mailman/listinfo/pulp-dev > >
_______________________________________________ Pulp-dev mailing list [email protected] https://www.redhat.com/mailman/listinfo/pulp-dev
