Changeset: 4d456a4a0434 for MonetDB URL: http://dev.monetdb.org/hg/MonetDB?cmd=changeset;node=4d456a4a0434 Added Files: monetdb5/modules/mal/replication.mx Modified Files: tools/merovingian/ChangeLog.Jul2012 tools/merovingian/client/monetdb.1 tools/merovingian/daemon/forkmserver.c tools/merovingian/daemon/merovingian.c tools/merovingian/utils/properties.c Branch: replicationms Log Message:
replicationms: initial commit (backout of 877b04706e12) Bring back master-slave replication work in replicationms branch. diffs (truncated from 1653 to 300 lines): diff --git a/monetdb5/modules/mal/replication.mx b/monetdb5/modules/mal/replication.mx new file mode 100644 --- /dev/null +++ b/monetdb5/modules/mal/replication.mx @@ -0,0 +1,1490 @@ +@/ +The contents of this file are subject to the MonetDB Public License +Version 1.1 (the "License"); you may not use this file except in +compliance with the License. You may obtain a copy of the License at +http://www.monetdb.org/Legal/MonetDBLicense + +Software distributed under the License is distributed on an "AS IS" +basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See the +License for the specific language governing rights and limitations +under the License. + +The Original Code is the MonetDB Database System. + +The Initial Developer of the Original Code is CWI. +Portions created by CWI are Copyright (C) 1997-July 2008 CWI. +Copyright August 2008-2012 MonetDB B.V. +All Rights Reserved. +@ + +@f replication + +@c +/* + * @a Martin Kersten + * @v 1.0 + * @+ Database replication + * MonetDB supports a simple database replication scheme using a master-slave + * protocol. A master node keeps a log of all SQL updates for replay. + * Once a slave starts the master establishes + * a MAL-client connection to the slave and starts pumping the backlog + * of committed transactions. + * The master does not take any responsibility over the integrity of a slave. + * The master may, however, decide to suspend + * forwarding updates to prepare for e.g. administration or shutdown. + * + * It is the slave's responsibility to be resilient against duplicate + * transmission of the MAL-update backlog. A transaction id + * can be given to catch up from transactions already replayed. + * Transaction ideas before the minimum available in the log + * directory leads to freezing the slave. Then rebuilding from + * scratch is required. + * + * The replication scheme does not support SQL scheme modifications. + * Instead, the slaves should be initialized with a complete copy + * of the master schema and the database. + * + * Turning an existing database into a master and creation of a single + * slave works as follows. + * + * step 1) Turn the database into a replication master by setting its + * "master" property to true using monetdb(1). This property is translated + * by merovingian(1) into the database variable "replication_master" and is + * set upon database (re)start. Note that this setting can not be added to a + * running database. + * + * step 2) Create a dump of the master database using the msqldump(1) tool. + * + * step 3) To initiate a slave, simply load the master snapshot. + * + * step 4) Run monetdb(1) to turn the database into a slave by setting its "slave" property to the URI of the master. + * The precise URI can be obtained issuing the command + * 'mclient -lmal -dmaster -s"u := master.getURI(); io.printf(\"%s\n\", u);"' on the master. + * The slave property is translated by merovingian(1) into the database variable "replication_slave" + * and is set upon database (re)start. Note that this setting can not be added to a running database. + * + * The slave starts synchronizing with the master automatically upon each session restart. + * A few SQL wrapper procedures and functions can be used to control it manually. + * For example, the slave can temporarily suspend receiving log replays using suspendSync() + * and reactive it afterwards with resumeSync(). + * A resumeSync() is also needed if you create a relation already known by the master, + * for it could have sent updates already. Due to unavailability of the target + * table it closed the log stream. + * + * The function freezeSlaves() removes the log files and makes sure that all + * existing slaves won't be able to catch up other than by re-initializing the + * database using e.g. a checkpoint. + * @verbatim + * CREATE PROCEDURE suspendSync() EXTERNAL NAME slave."stop"; + * CREATE PROCEDURE resumeSync() EXTERNAL NAME slave."sync"; + * CREATE FUNCTION synchronizing() RETURNS boolean EXTERNAL NAME slave."synchronizing"; + * + * CREATE PROCEDURE freezeSlaves() EXTERNAL NAME master."freeze"; + * CREATE PROCEDURE suspendSlaves() EXTERNAL NAME master."stop"; + * CREATE PROCEDURE resumeSlaves() EXTERNAL NAME master."start"; + * CREATE FUNCTION master() RETURNS string EXTERNAL NAME master."getURI"; + * CREATE FUNCTION cutOffTag() RETURNS string EXTERNAL NAME master."getCutOffTag"; + * @end verbatim + * + * It is possible to make a slave database also a master for descendants. + * In such situation the database carries both a master and slave property. + * Creating such scheme allows to employ hierarchical replication, or to + * have additional tables available in the replication stream. Note that + * at this point replication from multiple masters to e.g. combine a full + * set from a set of partitioned masters is not yet possible. + * + * Beware, turning off the "master" property leads to automatic removal of all + * left-over log files. This renders the master database unusable for replication. + * The state of the slaves becomes frozen. + * To restore replication in such case, both master and + * slaves have to be reinitialised using the aforementioned steps. + * + * @- Behind the scene + * When the replication_master environment is set, an optimizer + * becomes active to look after updates on SQL tables and to prepare + * for producing the log files. The snippet below illustrates the + * modifications made to a query plan. + * + * @verbatim + * function query():void + * master:= "mapi:monetdb://gio.ins.cwi.nl:50000/dbmaster"; + * fcnid:= master.open(); + * ... + * sql.append("schema","table","col",b:[:oid,:int]); + * master.append("schema","table","col",b,fcnid); + * ... + * t := mtime.current_timestamp(); + * master.close(fcnid,t); + * end query; + * @end verbatim + * + * At runtime this leads to buffers being filled with the statements + * required for the slaves to catch up. + * Each query block is stored in its own buffer and sent at + * the end of the query block. This separates the concurrent + * actions on the database at the master and leads to a serial + * execution of the replication operations within the slave. + * + * The log records are stored in a file "dbfarm/db/master/log%d-%d" with the + * following structure: + * @verbatim + * function slave.tag1(transactionid:int,stamp:timestamp); + * barrier doit:= slave.open(transactionid); + * sql.transaction(); + * tag1_b := bat.new(:oid,:int); + * ... + * bat.insert(tag1_b,3:oid,232:int); #example update + * ... + * sql.append("schema","table","col",tag1_b,tag); + * slave.close(transactionid,stamp); + * sql.commit(); + * exit doit; + * end tag1; + * slave.tag_1(1,"2009-09-03 15:49:45.000":timestamp); + * slave.drop("tag1"); + * @end verbatim + * + * The slave.open() simply checks the replica log administration table + * and ignores duplicate attempts to roll the database forward. + * + * The operations are executed in the serial order as on the master, + * which should lead to the same optimistic transactional behavior. + * All queries are considered running in auto-commit mode, because + * the SQL frontend does not provide the hook (yet) for better transaction + * boundary control. + * The transaction identifier is part of the call to the function + * with the transaction update details. + * @- Interaction protocol + * The master node simply waits for a slave to request the transmission of the missing log files. + * The request includes the URI of the slave and the user credentials needed to establish a connection. + * The last parameter is the last known transaction id successfully re-executed. + * The master forks a thread to start flushing the blacklog files. + * + * Grouping the operations in temporary MAL functions + * makes it easy to skip its execution when we detect + * that it has been executed before. + * + * @- Log file management + * The log records are grouped into separate files. + * They are the units for re-submission and the scheme is set up to be idempotent. + * A slave always starts synchronizing using the maximal tag stored in the slave log. + * + * The log files ultimately pollute your database and have to + * be (re)moved. This is considered a responsibility for the DBA, + * for it involves making a checkpoint or securely storing the logs + * into an archive. It can be automated by asking all slaves + * for their last transaction id and purge all obsolete files. + * + * Any error recognized during the replay should freeze the slave, + * because the synchronization integrity might become compromised. + * + * Aside from being limited to autocommit transactions, the current + * implementation scheme has a hole. The log record is written just + * before transaction commit, including the activation call. + * The call and the flush of the commit record to the SQL + * log should be one atomic action, which amounts to a commit + * sequence of two 'databases'. It can only be handled when + * the SQL commit becomes visible at the MAL layer. + * [ Or, inject the transaction approval record into the log file + * when the next query starts, checking for any transaction + * errors first.] + * + * COPY INTO commands cause the master to freeze the images of + * all slaves. For capturing the input file and forwarding it to + * the slaves seems overly complicated. + * + * The slaves invalidation scheme is rather crude. The log directory + * is emptied and a new log file is created. Subsequent attempts + * by the slaves to access transactions ID before the invalidation + * are flagged as errors. + * + * @- Wishlist + * After setting the slave property, it could initiate full synchronization + * by asking for a catalog dump and replaying the logs. Provided, they + * have been kept around since the start. + * Alternatively, we can use the infrastructure for Octopus to pull the data from the master. + * For both we need msqldump functionality in the SQL code base. + * + * A slave property can be set to a list of masters, which turns the + * the slave into a serving multiple sources. It calls for splitting + * the slavelog. + * + * The tables in the slave should be set read-only, otherwise we + * have to double check integrity and bail out replication on violation. + * One solution is to store the replicated database in its own + * schema and grant read access to all users. + * [show example how to set up ] + * + * A validation script (or database diff) might be helpful to + * asses the database content for possible integrity violations. + */ +@mal +module master; + +command open():oid +address MASTERopen +comment "Create a replication record"; + +command close(tag:oid):void +address MASTERclose +comment "Close the replication record"; + +command start():void +address MASTERstart +comment "Restart synchronisation with the slaves"; + +command stop():void +address MASTERstop +comment "Stop synchronisation of the slaves"; + +command freeze():void +address MASTERfreeze +comment "Invalidate all copies maintained at slaves"; + +pattern append(mvc:ptr, s:str, t:str, c:str, :any_1, tag:oid):ptr +address MASTERappendValue +comment "Dump the scalar on the MAL log"; + +pattern append(mvc:ptr, s:str, t:str, c:str, b:bat[:oid,:any_1], tag:oid):ptr +address MASTERappend +comment "Dump the BAT on the MAL log"; + +pattern delete(s:str, t:str, b:bat[:oid,:any_1], tag:oid):void +address MASTERdelete +comment "Dump the BAT with deletions on the MAL log"; + +pattern copy(sname:str, tname:str, tsep:str, rsep:str, ssep:str, ns:str, fname:str, nr:lng, offset:lng, tag:oid):void +address MASTERcopy +comment "A copy command leads to invalidation of the slave's image. A dump restore will be required."; + +pattern replay(uri:str, usr:str, pw:str, tag:oid):void +address MASTERreplay +comment "Slave calls the master to restart sending the missing transactions +from a certain point as a named user."; + +command sync(uri:str, usr:str, pw:str, tag:oid):void +address MASTERsync +comment "Login to slave with credentials to initiate submission of the log records"; + +command getURI():str +address MASTERgetURI +comment "Return the URI for the master"; + +command getCutOffTag():oid +address MASTERgetCutOffTag +comment "Return the cutoff tag for transaction synchronization"; + +command prelude():void +address MASTERprelude +comment "Prepare the server for the master role. Or remove any leftover log files."; + +module slave; + +command sync():void +address SLAVEsyncDefault +comment "Login to master with environment credentials to initiate submission of the log records"; +command sync(uri:str):void +address SLAVEsyncURI +comment "Login to master with admin credentials to initiate submission of the log records"; +command sync(uri:str, usr:str, pw:str, tag:oid):void +address SLAVEsync +comment "Login to master uri with admin credentials to initiate submission of the log records"; + +command stop():void +address SLAVEstop +comment "Slave suspends synchronisation with master"; _______________________________________________ Checkin-list mailing list [email protected] http://mail.monetdb.org/mailman/listinfo/checkin-list
