Re: [Ledger-smb-devel] Database schema changes && related data migrations

Erik Huelsmann Sun, 23 Oct 2016 04:43:45 -0700

Hi David,

On Sun, Oct 23, 2016 at 2:50 AM, David G <[email protected]> wrote:


> As of 1.5, we have a great database schema maintenance tool in place
> (sql/changes/) which remembers which changes have already been applied to
> the schema and which still have to be applied. This new tool has gotten rid
> of the truckload of warnings and errors that the repeated loading of
> Fixes.sql gave us on 1.3 and 1.4.
>
> Preparing to merge my MC branch to master for 1.6, I'm thinking that the
> sql/changes/ infrastructure works perfectly to handle schema changes.
>
> However, if such a schema change would be accompanied with the need to run
> a data migration, the sql/changes/ infrastructure suddenly won't work too
> great anymore, because I'm expecting that we'll want to improve on data
> migrations as we gain knowledge of the various data sets out there.
>
> Regardless of the technical limitations of the sql/changes/
> infrastructure, I'm also envisaging that we'll want to do "pre-migration
> checks" and allow users to correct data before the data migration (to a
> changed schema) is executed, if they have data that's non-compliant. While
> the pre-migration checks are definitely an easy extension to the
> sql/changes/ infrastructure, it's hard to provide feedback from the checks
> as it is. The other requirement (fixing data) can't be resolved at all.
>
> My current thinking is that schema improvement becomes a continuously
> incremental thing instead of these one-off huge jumps such as we had on 1.3
> and 1.4, *if* we find a way to address these requirements:
>
> Just to clarify, the following points are all targeted at checks and
> changes to DATA to ensure it's consistent as per a new schema.
> For example, there could be a case where a date that should always be set
> historically hasn't been sometimes.
> A new schema may now have a constraint requiring it to be set, but the
> dataset needs to have the missing dates added.
>

 That, or in the case of the MC branch, the original data migration is
likely to overlook some scenarios for data to be migrated. My expectation
is that we'll need two types of actions here:

 a. Run an adjusted data migration on all new migrations
 b. Run (once and only once) a migration-adjustment on migrations run prior
to the improvements

What I'm looking for is a structure / design which satisfies these
requirements so we can start coding the migrations.

>  1. Run pre-checks on schema changes
>
> 1a. generate a "pre-check" report
>

While I generally agree, we currently run all changes in a single call to
the database schema manager which applies all changes, until it hits the
first unsuccessful one.

>  2. Allow users to adjust / add / delete data in order for the pre-checks
> to be satisfied
>
> 2a. generate a "post-corrections" report
>

Our current changes are fully coded in SQL code, with a small Perl driver
to manage the order in which the changes are being applied in the right
order and only once (the last bit is the improvement over Fixes.sql).

>From the 1.3/1.4 days, we have some pretty extensive SQL included in our
LedgerSMB::Upgrade_Tests module. Those get tested every time an upgrade is
being run and are not directly related to a specific change being  applied.
I'd rather not use a similar approach with the sql/changes/ infrastructure,
because for any change that we ran in the past, we don't need to run the
pre-migration checks any more.

>  3. Run post-schema-change data migrations
>  4. Allow data migration for the same change to be tweaked (i.e. different
> data migrations == different sha-sums)
>  5. Allow additional data migrations to be run *iff* (if-and-only-if) a
> specific data migration has been run in the past (my idea here is to allow
> the data to be "upgraded" according to progressive insight)
>
> 6. Ask the user to save the two reports and contact the team on #ledgersmb
> <https://riot.im/app/#/room/%23ledgersmb:matrix.org> to discuss what
> needed manual adjustment.
> The Logs can be provided *IF* it's required.
>

Ok. That works (for those users who don't mind sharing part of their data).
However, you skipped the complicating factor that we may want to run
"follow-ups to (3)" in later migrations, if a specific version of the
migration script has been run before. Or am I making things too complex now?


> It may be possible to come up with a way to anonymise the logs and provide
> an optional step to do so.
>
> I would really love to hear your thoughts on the topic!
>
> Other than my above insertions I think it's a sound plan that not only
> improves our ease of data migration, but also should drastically simplify
> imports from other software such as SQLLedger (which we have had quite a
> few migrations in the last 12 months) and hopefully others such as
> Quickbooks, MYOB
>

:-) Well, the plan is there, but now I'd like us to come up with a
structure that might actually *do* this. One thing I can think of is that
we add a file next to the change file which holds one or more queries to do
the pre-migration checks. These queries would be executed and when
returning zero rows, the schema change is good to go.
This idea doesn't address the supposedly required adjustment screens nor
the post-migration scripts for migrating the data in the various envisioned
versions. One step at a time though.

I'll assume that everybody is ok with the idea of having a second file with
pre-checks. Then the next step will be: what to do if one of the queries
*does* return a number of (failing) rows? Then we need:
 - A way to tell the change-caller about the failing pre-checks
 - A way for the caller to send feedback to the user

I'm formulating these as abstractnesses, because I don't think we should do
as older code does - assume that we're in a web-app context and simply send
a response to a presumed request. The caller is responsible for handling
user feedback. The question then becomes: what information does the user
need to be able to resolve the data issues?
My thought here is that the user needs:
- An explanation of what the change itself is trying to do
- An explanation of the condition that failed
- An explanation of the available corrective actions as the development
team sees them
- For every technical field in the query's rows, a mapping to the
associated description
- A list of failing records, with a clear indication of the column with the
failed values
  It's important to sort this column in a way that the failure (such as
duplicates) is immediately apparent
- Help filling out the correct value (e.g. a list to select from if the
values are restricted)

Optionally, the user should have the opportunity to delete records or add
records in the target table, if that's marked as a suitable resolution
strategy.


Once more assuming you agree with me here, I'm going to the next step and
think how to implement the above. While I can envision the change-applying
code to take a callback which receives the data required to provide the
user with the indicated feedback, no provisions for it have been taken in
the "check queries" proposal so far.
I'm wondering: if we name check files as follows: <change>.check.<n>.sql,
and include one verification query in each file, then we can introduce a
header which provides:
- An explanation of the condition that failed
- An explanation of the available corrective actions
A query which provides the failing records is already provided as that was
the idea to introduce check queries in the first place.

Putting the description of what the change is achieving in each check file
doesn't seem like a good idea, yet including it in the change file doesn't
seem like a good idea either, because that changes the checksum and we
don't want to *ever* change the checksum after a change has been created to
prevent double application.
Better ideas?


Proceeding to the next topic:
As we intend to get the user feedback in a web-app situation, I think the
callback needs to be able to:
- cancel the changes from being applied
- indicate problem resolution in these three variations:
  * delete a row
  * provide an alternate value
  * insert a new row with values provided

I haven't thought about this in depth yet, but it seems we have enough to
think about with the above.


Let's continue exchanging views to get the best solution possible.

BTW, this discussion isn't triggered by my desire to move MC to master, but
rather my desire to add a few NOT NULL constraints to our database where I
*know* there to exist conflicting data.



-- 
Bye,

Erik.

http://efficito.com -- Hosted accounting and ERP.
Robust and Flexible. No vendor lock-in.

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot

_______________________________________________
Ledger-smb-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ledger-smb-devel

Re: [Ledger-smb-devel] Database schema changes && related data migrations

Reply via email to