[jira] [Updated] (HDFS-3885) QJM: optimize log sync when JN is lagging behind

Todd Lipcon (JIRA) Thu, 06 Sep 2012 19:58:10 -0700

     [ 
https://issues.apache.org/jira/browse/HDFS-3885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Todd Lipcon updated HDFS-3885:
------------------------------

    Attachment: hdfs-3885.txt

It wasn't easy to figure out how to write a unit test for this change, but I 
verified as follows:

- Started a 3-node QJM cluster
- strace -efdatasync,write -f <pid of one JN>
- write lots of txns to the NN. This shows a lot of fdatasync and write calls, 
mostly alternating (write a chunk, fsync, write a chunk, fsync, etc)
- kill -STOPped that JN for 10-15 seconds
- kill -CONT that JN
- saw a bunch of write() with no fdatasync calls while it was still catching 
up. After it caught up, it started syncing again.

I also verified that it caught up much faster with this change in place.
                
> QJM: optimize log sync when JN is lagging behind
> ------------------------------------------------
>
>                 Key: HDFS-3885
>                 URL: https://issues.apache.org/jira/browse/HDFS-3885
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>    Affects Versions: QuorumJournalManager (HDFS-3077)
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>         Attachments: hdfs-3885.txt
>
>
> This is a potential optimization that we can add to the JournalNode: when one 
> of the nodes is lagging behind the others (eg because its local disk is 
> slower or there was a network blip), it receives edits after they've been 
> committed to a majority. It can tell this because the committed txid included 
> in the request info is higher than the highest txid in the actual batch to be 
> written. In this case, we know that this batch has already been fsynced to a 
> quorum of nodes, so we can skip the fsync() on the laggy node, helping it to 
> catch back up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HDFS-3885) QJM: optimize log sync when JN is lagging behind

Reply via email to