On 18/06/17 07:35, Amrith Kumar wrote:
Trove has evolved rapidly over the past several years, since integration
in IceHouse when it only supported single instances of a few databases.
Today it supports a dozen databases including clusters and replication.
The user survey [1] indicates that while there is strong interest in the
project, there are few large production deployments that are known of
(by the development team).
Recent changes in the OpenStack community at large (company
realignments, acquisitions, layoffs) and the Trove community in
particular, coupled with a mounting burden of technical debt have
prompted me to make this proposal to re-architect Trove.
This email summarizes several of the issues that face the project, both
structurally and architecturally. This email does not claim to include a
detailed specification for what the new Trove would look like, merely
the recommendation that the community should come together and develop
one so that the project can be sustainable and useful to those who wish
to use it in the future.
TL;DR
Trove, with support for a dozen or so databases today, finds itself in a
bind because there are few developers, and a code-base with a
significant amount of technical debt.
Some architectural choices which the team made over the years have
consequences which make the project less than ideal for deployers.
Given that there are no major production deployments of Trove at
present, this provides us an opportunity to reset the project, learn
from our v1 and come up with a strong v2.
An important aspect of making this proposal work is that we seek to
eliminate the effort (planning, and coding) involved in migrating
existing Trove v1 deployments to the proposed Trove v2. Effectively,
with work beginning on Trove v2 as proposed here, Trove v1 as released
with Pike will be marked as deprecated and users will have to migrate to
Trove v2 when it becomes available.
I'm personally fine with not having a migration path (because I'm not
personally running Trove v1 ;) although Thierry's point about choosing a
different name is valid and surely something the TC will want to weigh
in on.
However, I am always concerned about throwing out working code and
rewriting from scratch. I'd be more comfortable if I saw some value
being salvaged from the existing Trove project, other than as just an
extended PoC/learning exercise. Would the API be similar to the current
Trove one? Can at least some tests be salvaged to rapidly increase
confidence that the new code works as expected?
While I would very much like to continue to support the users on Trove
v1 through this transition, the simple fact is that absent community
participation this will be impossible. Furthermore, given that there are
no production deployments of Trove at this time, it seems pointless to
build that upgrade path from Trove v1 to Trove v2; it would be the
proverbial bridge from nowhere.
This (previous) statement is, I realize, contentious. There are those
who have told me that an upgrade path must be provided, and there are
those who have told me of unnamed deployments of Trove that would
suffer. To this, all I can say is that if an upgrade path is of value to
you, then please commit the development resources to participate in the
community to make that possible. But equally, preventing a v2 of Trove
or delaying it will only make the v1 that we have today less valuable.
We have learned a lot from v1, and the hope is that we can address that
in v2. Some of the more significant things that I have learned are:
- We should adopt a versioned front-end API from the very beginning;
making the REST API versioned is not a ‘v2 feature’
- A guest agent running on a tenant instance, with connectivity to a
shared management message bus is a security loophole; encrypting
traffic, per-tenant-passwords, and any other scheme is merely lipstick
on a security hole
Totally agree here, any component of the architecture that is accessed
directly by multiple tenants needs to be natively multi-tenant. I
believe this has been one of the barriers to adoption.
- Reliance on Nova for compute resources is fine, but dependence on Nova
VM specific capabilities (like instance rebuild) is not; it makes things
like containers or bare-metal second class citizens
- A fair portion of what Trove does is resource orchestration; don’t
reinvent the wheel, there’s Heat for that. Admittedly, Heat wasn’t as
far along when Trove got started but that’s not the case today and we
have an opportunity to fix that now
+1, obviously ;)
Although I also think Kevin's suggestion is worthy of serious consideration.
- A similarly significant portion of what Trove does is to implement a
state-machine that will perform specific workflows involved in
implementing database specific operations. This makes the Trove
taskmanager a stateful entity. Some of the operations could take a fair
amount of time. This is a serious architectural flaw
- Tenants should not ever be able to directly interact with the
underlying storage and compute used by database instances; that should
be the default configuration, not an untested deployment alternative
- The CI should test all databases that are considered to be ‘supported’
without excessive use of resources in the gate; better code
modularization will help determine the tests which can safely be skipped
in testing changes
- Clusters should be first class citizens not an afterthought, single
instance databases may be the ‘special case’, not the other way around
- The project must provide guest images (or at least complete tooling
for deployers to build these); while the project can’t distribute
operating systems and database software, the current deployment model
merely impedes adoption
- Clusters spanning OpenStack deployments are a real thing that must be
supported
This might sound harsh, that isn’t the intent. Each of these is the
consequence of one or more perfectly rational decisions. Some of those
decisions have had unintended consequences, and others were made knowing
that we would be incurring some technical debt; debt we have not had the
time or resources to address. Fixing all these is not impossible, it
just takes the dedication of resources by the community.
I do not have a complete design for what the new Trove would look like.
For example, I don’t know how we will interact with other projects (like
Heat). Many questions remain to be explored and answered.
Would it suffice to just use the existing Heat resources and build
templates around those, or will it be better to implement custom Trove
resources and then orchestrate things based on those resources?
(Context: Amrith and I discussed this already)
The idea here is that there are some things that the Heat 'workflow'
doesn't handle by itself - for example, quiescing a server prior to
rebuilding (as opposed to replacing) it. The most obvious way to do that
(discussed in Amrith's next paragraph) is to drive it from some workflow
outside of Heat, with a Heat stack update to rebuild the server as one
of the steps. However, an alternative might be to implement custom Heat
resources that codify the required workflow.
IMHO this doesn't really improve the problem described above ("This
makes the Trove taskmanager a stateful entity. Some of the operations
could take a fair amount of time. This is a serious architectural flaw")
so much as move it around - Heat persists state at the resource level,
but isn't really well set up to handle a lot of state within a resource.
Would Trove implement the workflows required for multi-stage database
operations by itself,
One option to look at here is the taskflow library that Josh and others
wrote. It works well for the case where the workflow can be hard-coded
in code (which I think may fit this use case). It's already used by
Cinder, and perhaps other projects.
or would it rely on some other project (say
Mistral) for this? Is Mistral really a workflow service, or just cron on
steroids? I don’t know the answer but I would like to find out.
Mistral really is a workflow service. It uses YAML rather than Python to
define workflows, so it's better than taskflow for the case where the
workflow needs to be generated at runtime. Obviously it also has the
advantage of a multi-tenant REST API, so it can provide a plugability
point for users to customise. It's possible that neither of those
advantages are relevant in this situation.
One potential advantage of Mistral is that the workflows can be set up
as part of a Heat template. If all of the workflows were set up like
that, it would be easy for users to use the generated templates as a
private database management layer on a cloud that didn't offer it
as-a-Service.
The disadvantage, obviously, is that it requires the cloud to offer
Mistral as-a-Service, which currently doesn't include nearly as many
clouds as I'd like.
While we don’t have the answers to these questions, I think this is a
conversation that we must have, one that we must decide on, and then as
a community commit the resources required to make a Trove v2 which
delivers on the mission of the project; “To provide scalable and
reliable Cloud Database as a Service provisioning functionality for both
relational and non-relational database engines, and to continue to
improve its fully-featured and extensible open source framework.”[2]
+1
cheers,
Zane.
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev