Re: [openstack-dev] [trove][all][tc] A proposal to rearchitect Trove

Zane Bitter Tue, 20 Jun 2017 07:24:59 -0700

On 18/06/17 07:35, Amrith Kumar wrote:

Trove has evolved rapidly over the past several years, since integrationin IceHouse when it only supported single instances of a few databases.Today it supports a dozen databases including clusters and replication.
The user survey [1] indicates that while there is strong interest in theproject, there are few large production deployments that are known of(by the development team).
Recent changes in the OpenStack community at large (companyrealignments, acquisitions, layoffs) and the Trove community inparticular, coupled with a mounting burden of technical debt haveprompted me to make this proposal to re-architect Trove.
This email summarizes several of the issues that face the project, bothstructurally and architecturally. This email does not claim to include adetailed specification for what the new Trove would look like, merelythe recommendation that the community should come together and developone so that the project can be sustainable and useful to those who wishto use it in the future.
TL;DR
Trove, with support for a dozen or so databases today, finds itself in abind because there are few developers, and a code-base with asignificant amount of technical debt.
Some architectural choices which the team made over the years haveconsequences which make the project less than ideal for deployers.
Given that there are no major production deployments of Trove atpresent, this provides us an opportunity to reset the project, learnfrom our v1 and come up with a strong v2.
An important aspect of making this proposal work is that we seek toeliminate the effort (planning, and coding) involved in migratingexisting Trove v1 deployments to the proposed Trove v2. Effectively,with work beginning on Trove v2 as proposed here, Trove v1 as releasedwith Pike will be marked as deprecated and users will have to migrate toTrove v2 when it becomes available.

I'm personally fine with not having a migration path (because I'm notpersonally running Trove v1 ;) although Thierry's point about choosing adifferent name is valid and surely something the TC will want to weighin on.

However, I am always concerned about throwing out working code andrewriting from scratch. I'd be more comfortable if I saw some valuebeing salvaged from the existing Trove project, other than as just anextended PoC/learning exercise. Would the API be similar to the currentTrove one? Can at least some tests be salvaged to rapidly increaseconfidence that the new code works as expected?

While I would very much like to continue to support the users on Trovev1 through this transition, the simple fact is that absent communityparticipation this will be impossible. Furthermore, given that there areno production deployments of Trove at this time, it seems pointless tobuild that upgrade path from Trove v1 to Trove v2; it would be theproverbial bridge from nowhere.
This (previous) statement is, I realize, contentious. There are thosewho have told me that an upgrade path must be provided, and there arethose who have told me of unnamed deployments of Trove that wouldsuffer. To this, all I can say is that if an upgrade path is of value toyou, then please commit the development resources to participate in thecommunity to make that possible. But equally, preventing a v2 of Troveor delaying it will only make the v1 that we have today less valuable.
We have learned a lot from v1, and the hope is that we can address thatin v2. Some of the more significant things that I have learned are:
- We should adopt a versioned front-end API from the very beginning;making the REST API versioned is not a ‘v2 feature’
- A guest agent running on a tenant instance, with connectivity to ashared management message bus is a security loophole; encryptingtraffic, per-tenant-passwords, and any other scheme is merely lipstickon a security hole

Totally agree here, any component of the architecture that is accesseddirectly by multiple tenants needs to be natively multi-tenant. Ibelieve this has been one of the barriers to adoption.

- Reliance on Nova for compute resources is fine, but dependence on NovaVM specific capabilities (like instance rebuild) is not; it makes thingslike containers or bare-metal second class citizens
- A fair portion of what Trove does is resource orchestration; don’treinvent the wheel, there’s Heat for that. Admittedly, Heat wasn’t asfar along when Trove got started but that’s not the case today and wehave an opportunity to fix that now


+1, obviously ;)

Although I also think Kevin's suggestion is worthy of serious consideration.

- A similarly significant portion of what Trove does is to implement astate-machine that will perform specific workflows involved inimplementing database specific operations. This makes the Trovetaskmanager a stateful entity. Some of the operations could take a fairamount of time. This is a serious architectural flaw
- Tenants should not ever be able to directly interact with theunderlying storage and compute used by database instances; that shouldbe the default configuration, not an untested deployment alternative
- The CI should test all databases that are considered to be ‘supported’without excessive use of resources in the gate; better codemodularization will help determine the tests which can safely be skippedin testing changes
- Clusters should be first class citizens not an afterthought, singleinstance databases may be the ‘special case’, not the other way around
- The project must provide guest images (or at least complete toolingfor deployers to build these); while the project can’t distributeoperating systems and database software, the current deployment modelmerely impedes adoption
- Clusters spanning OpenStack deployments are a real thing that must besupported
This might sound harsh, that isn’t the intent. Each of these is theconsequence of one or more perfectly rational decisions. Some of thosedecisions have had unintended consequences, and others were made knowingthat we would be incurring some technical debt; debt we have not had thetime or resources to address. Fixing all these is not impossible, itjust takes the dedication of resources by the community.
I do not have a complete design for what the new Trove would look like.For example, I don’t know how we will interact with other projects (likeHeat). Many questions remain to be explored and answered.
Would it suffice to just use the existing Heat resources and buildtemplates around those, or will it be better to implement custom Troveresources and then orchestrate things based on those resources?


(Context: Amrith and I discussed this already)

The idea here is that there are some things that the Heat 'workflow'doesn't handle by itself - for example, quiescing a server prior torebuilding (as opposed to replacing) it. The most obvious way to do that(discussed in Amrith's next paragraph) is to drive it from some workflowoutside of Heat, with a Heat stack update to rebuild the server as oneof the steps. However, an alternative might be to implement custom Heatresources that codify the required workflow.

IMHO this doesn't really improve the problem described above ("Thismakes the Trove taskmanager a stateful entity. Some of the operationscould take a fair amount of time. This is a serious architectural flaw")so much as move it around - Heat persists state at the resource level,but isn't really well set up to handle a lot of state within a resource.

Would Trove implement the workflows required for multi-stage databaseoperations by itself,

One option to look at here is the taskflow library that Josh and otherswrote. It works well for the case where the workflow can be hard-codedin code (which I think may fit this use case). It's already used byCinder, and perhaps other projects.

or would it rely on some other project (sayMistral) for this? Is Mistral really a workflow service, or just cron onsteroids? I don’t know the answer but I would like to find out.

Mistral really is a workflow service. It uses YAML rather than Python todefine workflows, so it's better than taskflow for the case where theworkflow needs to be generated at runtime. Obviously it also has theadvantage of a multi-tenant REST API, so it can provide a plugabilitypoint for users to customise. It's possible that neither of thoseadvantages are relevant in this situation.

One potential advantage of Mistral is that the workflows can be set upas part of a Heat template. If all of the workflows were set up likethat, it would be easy for users to use the generated templates as aprivate database management layer on a cloud that didn't offer itas-a-Service.

The disadvantage, obviously, is that it requires the cloud to offerMistral as-a-Service, which currently doesn't include nearly as manyclouds as I'd like.

While we don’t have the answers to these questions, I think this is aconversation that we must have, one that we must decide on, and then asa community commit the resources required to make a Trove v2 whichdelivers on the mission of the project; “To provide scalable andreliable Cloud Database as a Service provisioning functionality for bothrelational and non-relational database engines, and to continue toimprove its fully-featured and extensible open source framework.”[2]


+1

cheers,
Zane.

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [trove][all][tc] A proposal to rearchitect Trove

Reply via email to