On Fri, 24 Jul 2009 17:29:48 -0400, Michael Izioumtchenko
<[email protected]> wrote:
> Hi,
> 
> Alex Yurchenko wrote:
>> On Fri, 24 Jul 2009 10:56:20 +0200, Paul McCullagh
>> <[email protected]> wrote:
>>> On Jul 23, 2009, at 3:15 PM, Stewart Smith wrote:
>>>
>>>> On Tue, Jul 21, 2009 at 09:28:54PM -0700, MARK CALLAGHAN wrote:
>>>>> How is the serial log to be kept in sync with a storage engine given
>>>>> the Applier interface? MySQL uses two phase commit, but the Applier
>>>>> interface has one method, ::apply(). In addition to methods for
>>>>> performing 2PC, keeping a storage engine and the serial log in sync
>>>>> requires additional methods for crash recovery to support commit or
>>>>> rollback of transactions in state PREPARED in the storage engine
>>>>> depending on the outcome recorded in the serial log.
>>>> The bit that keeps banging in my head in regards to this is storing it
>>>> in the same engine as part of the transaction and so avoiding 2pc.
>>> We discussed this on Drizzle Day, and that was my recommendation.
>>>
>>> This would mean, after a transaction has committed, the replication  
>>> system asks the engine for a "list of operations" that were performed  
>>> by the transaction.
> 
> I welcome the idea of the meaningful conversation between the replication
> system
> and the engine. Mark's crash recovery challenge could probably be solved
by
> asking the engine
> to store a little piece of data in the redo log, then in the course of
> normal engine crash recovery
> the engine will report it back to replication so the replication will
know
> what exactly to replay.
> One thing I'm confident that can not be solved without it is addressing
the
> problem where the application
> 'optimizes' redo log flushing as what is done with
> innodb_flush_log_at_trx_commit.
> 
> otoh I hope that 'ask for a list of operations after the commit' is just
an
> algorithm description,
> not the actual implementation. I think the communication could be made
into
> something more simultaneous
> especially for the regime where the engine is normally asked to do row by
> row operations.
> 
>>>
>>> For engines that have this information in their transaction log, it is 

>>> a relatively simple task.
>> 
>> This is absolutely the way to go. From the replication perspective
>> picking
>> log events straight form the storage engine transactional log would save
>> quite a lot of IO and CPU. In addition those log events can be in the
>> engine native form, so a blast to apply on a slave.
> 
> not necessarily a blast since depending on the engine workings the same
> redo
> wouldn't necessarily easily apply on another node. redo generation logic
is
> usually optimized
> for the immediate task at hand which is crash recovery on the same system
> without slowing
> down normal operation too much so adjusting it for replication tends to
lag
> behind, Oracle
> logical replication is an example of how complex it can become. oracle's
> physical replication
> is an example when it becomes a blast as you say, but then the slave
tends
> to have limited functionality
> which takes effort to overcome.

Well, it is surely possible to optimise redo log records to the extent that
they are not applicable anywhere but this server only. However if storage
engine vendor optimizes the records to be a blast to apply on a slave, then
it follows logically that they might be a tad slower for the crash recovery
but still be a blast (by virtue of being a blast on a slave). Now if you
think how often you might be doing crash recovery and how often you might
be applying replication events on a slave... speed on a slave wins hands
down.

They just gotta stop being oblivious to replication requirements, them
bloody storage engine developers!

Regards,
Alex

> Thanks,
> Michael
> 
>> 
>>>> Then, in 99.9% of cases when there are not cross engine  
>>>> transactions, we
>>>> never need 2pc.
>>>>
>>>> Although I haven't given this intense deep thought as to various  
>>>> corner cases...
>>>>
>>>> -- 
>>>> Stewart Smith
>>>>
>>>> _______________________________________________
>>>> Mailing list: https://launchpad.net/~drizzle-discuss
>>>> Post to     : [email protected]
>>>> Unsubscribe : https://launchpad.net/~drizzle-discuss
>>>> More help   : https://help.launchpad.net/ListHelp
>>>
>>>
>>> --
>>> Paul McCullagh
>>> PrimeBase Technologies
>>> www.primebase.org
>>> www.blobstreaming.org
>>> pbxt.blogspot.com
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Mailing list: https://launchpad.net/~drizzle-discuss
>>> Post to     : [email protected]
>>> Unsubscribe : https://launchpad.net/~drizzle-discuss
>>> More help   : https://help.launchpad.net/ListHelp
>>
-- 
Alexey Yurchenko,
Codership Oy, www.codership.com
Skype: alexey.yurchenko, Phone: +358-400-516-011

_______________________________________________
Mailing list: https://launchpad.net/~drizzle-discuss
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~drizzle-discuss
More help   : https://help.launchpad.net/ListHelp

Reply via email to