Re: [Ledger-smb-devel] Managing database change: improvements needed

John Locke Thu, 27 Aug 2015 22:57:00 -0700

Hi,


On 08/26/2015 01:18 PM, Erik Huelsmann wrote:

Some more info:
It seems Sqitch wants to address the issue of multiple branchesreceiving the same patches and upgrading from one branch to another,like we want to support: https://github.com/theory/sqitch/issues/200

I reviewed some of this, and the Skitch Postgres tutorial... it soundsto me that we're not at the level of complexity that would pose thislevel of problem -- main things we should keep in mind is to alwaysdeclare dependencies, and then can make use of a "union merge" if wedon't want to resolve merge conflicts manually.

Regards,

Erik.

On Wed, Aug 26, 2015 at 9:13 PM, Erik Huelsmann <[email protected]<mailto:[email protected]>> wrote:


    Hi,

            Hi all,

            There are a few issues that I have with the way we
            currently handle database schema changes.

            Current situation

            ============

            We have a main schema file for the table definitions and a
            number of modules with stored procedures grouped by "subject".

            Next to that, there's a "Fixes" file, which contains all
            incremental updates to the schema since some version.

            When I want to change the schema as part of our
            development, I need to change the schema definition files
            *and* I need to add the schema changes to the Fixes.sql file.


        I agree this is less than ideal.  I think we need to think of
        a comprehensive tooling approach and there are a few reasons
        we haven't done this (mainly because I think getting the
        tooling right is a lot of effort and there has in the past
        been other urgent things that needed more effort at the
        beginning).  However I think this is growing in urgency as we
        continue development.

    Right. I'm merely summarizing for those who didn't know the
    current situation and to support my line of reasoning. (I don't
    think you think that I am, but for the record: I'm not criticizing
    the current situation.)

As usual, I'll comment on how it's done elsewhere (Drupal) and myexperience with other upgrade systems...

Drupal ends up with a dual approach for schema upgrades: Each moduleprovides a base schema that is up-to-date with the current version, andany arbitrary series of update scripts with a "schema level". Theupgrade scripts are skipped for new installs. The "system" table tracksthe "Schema level" for each module, and for updates runs all scriptswith a greater schema level than is currently installed, updating theschema level as each is successfully installed.

There is no rollback with this system, and while you can declare somemodule updates as dependent on particular schema updates in othermodules, this is cumbersome and not widely used, and so there are oftenrace-condition-type issues on upgrades where one update can't completebecause it's dependent on an upgrade to another module that has not yet run.

I know there was talk of switching to a hash-based system rather than asimple integer sequence, I have not yet dug into Drupal 8 to see ifthere's any improvements here.

I did consult briefly on a Microsoft project on Azure, Drupal on SQLServer, and they had some interesting tools for managing schema updatesthat seem comparable to what Sqitch is attempting to do.

            Problems with the current approach

            ==========================

            The current approach has (IMO) several problems:

            * When loading the "Fixes.sql" file, even in a production
            environment, produces a truckload of ERRORs, because the
            schema may contain some or all of the fixes in the file,
            leading to aborted transactions

            * Making adjustments in multiple places can lead to
            incompleteness on either side

            * Fixes.sql is loaded twice; before and after loading the
            stored procedures, making it unsuitable for some kinds of
            fixes

            * In order to change types or parameter lists of stored
            procedures, they need to be dropped before being
            (re)created, resulting in DROP TYPE and DROP FUNCTION all
            over the place.

            Especially the high number of ERRORs in the log file when
            creating a new database is disconcerting to new users.
            Myself, I'm running into the DROP TYPE and DROP FUNCTION
            more and more often as my refactorings for 1.5 and 1.4-mc
            reach deeper into the system.

            Additional requirements

            =================

            * Next to solving the problems above, some users have
            expressed the desire to extend LedgerSMB locally. However,
            the way we currently upgrade the system removes all
            customizations (at least from 1.2->1.3->1.4; maybe not
            1.4->1.5). If our changed approach could help solve this
            requirement, all the better.

            * The solution chosen preferably works with a strategy of
            branching and merging as we do in the development cycle.

        Agreed on all the above.


    Good. Then we have good reasons to change :-)

            Possible solutions

            =============

            One thing that I think we should do is separate
            development from production when loading the database.
            That is to say: when creating a new database, that is a
            different step than upgrading a production database which
            is in turn again very different from working on a
            development database. Our processes should work to provide
            the best for each.


        I am not sure I understand what you mean by this.  Do you mean
        that we need an extra build step?  I would not be opposed to
        that.  But if that isn't what you mean, could you be more
        specific?


    Well, other than the standard Perl steps "copy to blib" and
    "manify POD", we don't really have build steps of our own. I'm
    indeed proposing that we have a build step. I envision it to be
    something like this:

    * Create PostgreSQL database
    * Play Sqitch scripts
    * Dump schema from the database

    and then consider the dumped schema a build artifact which doesn't
    need version control. This step would be executed before creating
    a distribution archive.

    The pre-distribution step should -if we go with Sqitch- also
    include compilation of a Sqitch bundle, so we don't distribute
    development progress, branches, or whatever that might be in the
    Sqitch repository (which is - as far as I understand it - stored
    in the development tree).

I'm not quite clear on all of this. Are you suggesting two differentSqitch steps -- one for the data schema/model, and a different one forthe stored functions? Is this a recipe for building a clean database,which would destroy all data on reversion? And thus something that youwould only run to get a clean environment -- but then you'd have adifferent set of Sqitch scripts for all the functions that change/evolvewithin a major release, including non-destructive schema changes? Ifthat's what you mean, then sounds good -- otherwise I'm lost as to thepurpose of this.


            Proposed solution to the duplicate change problem

            =====================================

            In the past we have been talking about adding a "build"
            step to LedgerSMB. This step could be used to produce a
            single schema file for quickly setting up a new database
            (company). This schema file would be a build artifact and
            no longer be version controlled itself. It'd be the result
            of a full schema creation step, including running the
            Fixes.sql file.

            Additionally, this step could be used to deliver an
            up-to-date doc/database/ directory with current database
            documentation.

        I like that idea.

            While this step may feel inhibiting for development, one
            thing I'm thinking is that we may not need to require this
            step to be executed on a development system, except for
            when testing the production deployment.

        And testing the build system.


    Exactly. I think that the Travis-CI tests (c/sh)ould include the
    generation of the schema and testing off of that.

            Proposed solution to the slew of errors from Fixes.sql

            =======================================

            There are actually a number of solutions here, as I see
            it, all of them revolving around the idea that every
            schema change should be applied once. Basically, I see 2
            categories of solutions:

            1. Do it ourselves

            2. Re-use the work of others

            The benefit of (1) is that we get full control and no
            further dependencies for development or production.
            However, the downside is that we get to do all the
            thinking and problem solving as well.

            In both categories I see similar solutions available:

            a. Numbered changes

            b. Named changes

            and in (1) I see a solution that's not available in
            category (2):

            c. Use PostgreSQL EXTENSIONs

            As for category (2), I haven't looked too far around yet,
            but I did find sqitch (http://sqitch.org/); Sqitch offers
            a command line tool for the development workflow.
            Additionally, it provides facilities for deploying only
            the necessary changes with releases *and* it provides an
            API which we can use to upgrade the database from one
            (patch) release to another.

        PG Extensions are definitely in the "reuse work of others"
        category but the system solves a different problem than sqitch.


    This is an interesting difference in views and we should explore
    it a bit.

        Indeed the two together could be a very powerful combination.

        PG Extensions would allow us to ship db components (whether
        table sets, etc) in a versioned way.  There is likely to be a
        learning curve here and it adds a major build dependency for
        people installing from our source packages.  But it provides
        some really nice features  including versioning and dependency
        tracking/enforcement.

        What PG Extensions would allow us to do would be to
        effectively ship versioned pieces of our db schema and have an
        orderly way of upgrading them.


    Ok. So far, I'm not sure I see the difference with Sqitch, apart
    from the fact that Sqitch is something we'd have to depend on and
    extensions are built into PostgreSQL. One thing where I can see a
    difference - but we wouldn't be using it, is my current estimation
    - is managing *external* dependencies. But with Sqitch, we can
    manage internal dependencies, so, if that's what we're using
    extensions for, what would be your envisioned difference?

          For things like stored procedure sections, we could always
        drop the extension and load a new version.  But it would make
        it very clean.  And we could even break up parts of the db
        into modules that could be independently versioned.


        Getting this tooling to work right is not a minor project
        however.  It definitely cannot be done in a minor release
        because of the additional dependencies, and I think we'd need

a policy against adding extensions to core in major versions.However it would also provide a nice platform for third party

        integration and reuse as well.

        Each of our modules could be an extension and separately
        versioned.


    Ok. I like the technical side of that, but I'm not sure we need
    that at this point. I *do* see value for this in terms of optional
    modules. Say the way we used to have Budgetting as an extension in
    1.3 (although it solves only 20% of the problem, because hooking
    up Budgetting into the web-application was actually a bigger
    problem than getting the tables installed).

I would generally be in favor of splitting up modules into independentversions, but agree that's a lot of work that probably isn't as much apriority now -- but thinking about it now is certainly appropriate. Iwonder what lessons we could take from the Budgeting module to defineextension points so that adding other optional modules is not so painful...

        For branching and merging, the extensions would have a number
        specific to development head and could be re-numbered during
        the build process.  For modules all we really need to do is
        drop extension/create extension if the version number has changed.

        We'd probably want to reserve a number series for "next in
        version series."  I am thinking that using 999 would be a good
        one, so 1.4.999 would be "whatever next comes in 1.4" and this
        would allow us to install and run out of the development tree.

-1 on using 1.4.999 as "next 1.4 version". If you're on 1.4.15,developing for 1.4.16, that interim development at 1.4.999 ends upgreater than the release! I would prefer using more establishedsemantics, mostly like we've been doing: 1.4.15 -> 1.4.16-dev >1.4.16-alpha1 > 1.4.16-beta1 > 1.4.16-rc1 > 1.4.16.

Drupal has taken to adding "+5" to indicate 5 commits past whatever themost recent release is -- e.g. 1.4.16-beta1+5. Don't use 1.4.999, unlessthat's a dev release of 1.5.0!

Hmmm. Sqitch - like version control systems - does this by itself
(in casu by generating a hash, like Git and other DVCSes).

As for sqitch, I have no experience with it yet. It's
supposed to work well with branching and merging. One
thing it *does* do is integrate with version control tools
and it means to integrate with the project's repository.
Ideally when merging and branching, no additional
processing is required to integrate the changes from
branches. However, I'm not exactly sure that's what
happens (given

https://groups.google.com/forum/#!searchin/sqitch-users/merge/sqitch-users/GXqgt7nJ_1k/Vvg-r1HOEqMJ

<https://groups.google.com/forum/#%21searchin/sqitch-users/merge/sqitch-users/GXqgt7nJ_1k/Vvg-r1HOEqMJ>)
but I think the referenced link is about reordering
changes which already have been partially deployed. What I
like about Sqitch is that it integrates with the VC
system, but is VC system agnostic, some of us can use it
with Git while others can keep using Hg-git as they
currently do.

Ok so if we ship our storage schema as an extension (or group
of extensions) we still need to have ordered versioning going on.

Well, it seems to me that that's what Sqitch does, *unless* you
have a database which is running some development version. From
tag to tag (which would be from release to release), ordering
doesn't seem to matter. The problem the page seems to refer to -
but I should probably subscribe to the mailing list and ask this
quention - a database which is left at an intermediary state.

My reading from the tutorial is that each individual step may declare adependency on some earlier step. The discussion mentioned above is moreabout handling merges from multiple branches.

        I am thinking that sqitch may allow us to do the grunt work of
        the schema change scripts as a part of the build phase (we
        have to strip out begin/commit statements since create/alter
        extension adds those), but it seems like it would be very
        helpful in managing that side.

I do see the value of this -- but it looks like an awful lot of gruntwork creating these schema change scripts, if you also have to carefullycreate matching revert and verify scripts for every change!

Seems like a lot of discipline to pull this off -- certainly somethingthat's worth doing if we aim to deliver a quality package.

I am curious on some of the tools that can supposedly detect thedifferences between two database schemas, and generate a script toconvert them? I think I did run across a couple modules that claimed todo this for SQL Server. I have no experience with these, or knowledgeabout reliability. But I'm lazy ;-)

If we don't have something like that, Sqitch seems like a great tool inthe arsenal, and something that seems quite approachable/understandable,and something we should be able to easily hook up in Travis.


    Yea. Maybe this too is a question to ask Sqitch's mailing list.

            What I *am* sure about is that (1) will be completely
            change-oriented. *Maybe* we can use sqitch with a the
            current module-based source code organization.

        If we go with extensions we don't need to.  We just bump the
        version number on each change of a module and keep a catalog
        of what is required (and tests to verify of course).  We can
        then drop/create extensions as needed for the procedural
        modules, and we can generate migration scripts for the for the
        storage extension(s).  The big difference is that instead of
        one db version we would have a bunch of components that were
        versioned and likely a requirement that these match on login.


    Hmm. I'm somehow that doesn't sound like a really attractive
    outlook: more things that (apparently) can go wrong.


... and now do we need our own dependency manager? :-)

I'm growing quite fond of Composer in the PHP world, and ran acrossCarton for managing Perl dependencies when putting together the Dockerimage. Carton definitely looks helpful for managing the installation ofspecific versions of CPAN modules to get to a consistent state. I wonderif that could be extended to cover Postgres extensions...

            If we're going to go with Sqitch, I don't think it's a
            good idea to switch just then and there, but test it to
            build up some experience and then choose a well defined
            point in time to start using it.

        Agreed.  Maybe start on plugins?


    Actually, I was thinking of starting with 1.5 *after* it's been
    upgraded from 1.4? However, that would mean that *everybody* would
    need to migrate through 1.5.0, because that's the moment we'll
    move to Sqitch? (Or we could choose 1.6 or ...?)

            So, there are 2 proposals here. What do you say?

        I don't see how we can get away from doing some things
        ourselves, so I guess the question is what and where to reuse
        other work.

        So my preference would be sqitch and pg extensions.  In my
        mind this would require a two-step build process for dbs:

        1.  Build the schema extensions using sqitch
        2.  Install all pg extensions using make.


    Ok. Is this on releases only, or with our day-to-day work on the tree?

    My view on using Sqitch for everything in our schema would be to
    replace the current schema loading with a call to Sqitch's API. Of
    course we have to do that ourselves, but it's not in the category
    that I'd call "do something ourselves", because all we have to do
    is create the code to delegate to "the other project's code". But
    figuring out *how* to do that, yes, that's our work.

        The first is a developer-only task (sort of like running YACC
        is when building PostgreSQL -- not even required when
        installing from source packages) so there isn't a user-visible
        dependency there.

        The second though does impose some user dependencies in that
        pg_config and other pg development tools must be installed
        before  the build can be done.  Of course for those installing
        debs, these would be done when building debs rather than when
        installing them.


    If we use sqitch only, we also need it as a dependency, both for
    developers and for end users. However, we probably don't need to
    learn our users about it, if we simply use it as a library.

Not sure if you've looked at the Docker builds I put together a couplemonths ago. They're at https://hub.docker.com/r/ledgersmb/ledgersmb/ .

At least for that installation path (which I think is becoming morepopular by the minute) the Sqitch approach is far easier to support, ifI understand these options... I can easily install Sqitch inside thecontainer and script whatever we need to do there to update the schemasin the database. However, I explicitly did not build Postgres into thoseimages, assuming you would connect the LSMB container to an externalPostgres container (or server). So installing an extension in Postgresbecomes a much more challenging proposition -- I'm assuming you needaccess to the Postgres environment to be able to install an extension?If you can install a Postgres extension over a TCP connection toPostgres, I suppose we can script that -- but at least from mystandpoint Sqitch sounds much easier to deal with.

Another potential challenge -- can you install extensions in acloud-provisioned Postgres server, such as Amazon RDS?

Unless I completely misunderstand how you install extensions, I thinkthey would cause some significant challenges to those who want toinstall in a cloud or container environment.


        So that is my preference and why.

    Thanks for your response!

    With my reaction above, I'm wondering how to get to a next step.
    We probably need to be practical in some sense and just start
    *somwhere*. But the question would be *where* and *with what* --
    in the sense that the *what* question seems to be a choice between
    (as we now know them) (a) Sqitch or (b) Sqitch+Extensions.

It does strike me that there's still a question of what schema changesdestroy data. From the Sqitch tutorial, I'm not seeing why we need abuild step as described earlier -- couldn't our setup simply run theSkitch deployment under the hood? Unless that's slow enough thatflattening it to a single database dump file we can quickly load is abig enough win... that is, other than the consideration of how do wesafely provide tools to roll database schema changes backwards andforwards on production databases, without losing data?

How does Sqitch track what has been deployed, and what has not? (Justthinking with my Docker hat on -- is this important metadata that needsto get preserved when lsmb containers get destroyed/recreated?)


Overall I like the looks of Skitch and would favor that approach...

Cheers,
John Locke
http://www.freelock.com

------------------------------------------------------------------------------

_______________________________________________
Ledger-smb-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ledger-smb-devel

Re: [Ledger-smb-devel] Managing database change: improvements needed

Reply via email to