[ 
https://issues.apache.org/jira/browse/HBASE-10278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13943504#comment-13943504
 ] 

Himanshu Vashishtha commented on HBASE-10278:
---------------------------------------------

Attached is a patch based on the new model I mentioned in my last comment.

h3. Overall design:
1. During FSHLog instantiation, open a reserved writer.
2. The SyncRunners does the sync to the file system as they do now. They 
registers to a map before calling the 'inflight-sync-ops' map, and unregisters 
themselves when done.
3. A monitoring thread, SyncOpsMonitor periodically iterates over the 
inflight-sync-ops map, and feed the start time of a sync op to the configured 
WALSwitchPolicy.
4. If switch policy decides to make the switch, it goes through steps mentioned 
in "WAL Switch workflow".
5. If there is a concurrent log roll going on, then we ignore the switch 
request.

h4. WAL Switch workflow
1. Grab the roll writer lock to ensure there is no concurrent log roll. It 
would do the roll after switching.
2. Block the processing of RingBufferHandler and let it reaches a 'safe point'. 
A safe point is just a marker to tell that the RingBuffer is blocked at this 
sequence Id. Let that sequence Id be 'X'.
3. Take the 'inflight' WALEdits (called as Appends ops), and SyncFutures from 
all the SyncRunners, and also from the RingBufferHandler (the later could be in 
the process of forming a SyncFuture batch while appending WALEdits). Ensure the 
ordering of Append ops. Ignore the SyncFutures with sequence Id > 'X'.
4. Use the reserved writer to append-sync these inflight edits.
5. Swap the writer with the reserved writer. 
6. Release all sync futures (to free up the handlers), and recreate 
SyncRunners, and interrupt the old SyncRunners.
7. Release the RingBuffer and starts the normal processing. 
8. Roll the old writer.
9. Release the rollWriter lock.


h3. Testing
I tested the HLogPE with trunk on a 5 node cluster, running hadoop2.2. 
{code}
On trunk:
Performance counter stats for 
'/home/himanshu/dists/hbase-0.99.0-SNAPSHOT/bin/hbase 
org.apache.hadoop.hbase.regionserver.w
al.HLogPerformanceEvaluation -iterations 1000000 -threads 10':

    1891960.295558 task-clock                #    2.396 CPUs utilized
        55,076,890 context-switches          #    0.029 M/sec
         1,770,901 CPU-migrations            #    0.936 K/sec
            73,650 page-faults               #    0.039 K/sec
 2,853,602,378,588 cycles                    #    1.508 GHz                     
[83.32%]
 2,126,410,331,760 stalled-cycles-frontend   #   74.52% frontend cycles idle    
[83.31%]
 1,274,582,986,073 stalled-cycles-backend    #   44.67% backend  cycles idle    
[66.72%]
 1,511,777,502,744 instructions              #    0.53  insns per cycle
                                             #    1.41  stalled cycles per insn 
[83.37%]
   264,303,859,957 branches                  #  139.698 M/sec                   
[83.33%]
     7,946,652,758 branch-misses             #    3.01% of all branches         
[83.33%]

     789.767027189 seconds time elapsed

Trunk + patch, with switch threshold = 1 sec.
 Performance counter stats for 
'/home/himanshu/10278-patch/hbase-0.99.0-SNAPSHOT/bin/hbase 
org.apache.hadoop.hbase.regionserver.wal.HLogPerformanceEvaluation -iterations 
1000000 -threads 10':

    1937313.168376 task-clock                #    2.450 CPUs utilized
        54,774,802 context-switches          #    0.028 M/sec
         1,981,573 CPU-migrations            #    0.001 M/sec
            63,150 page-faults               #    0.033 K/sec
 2,967,414,126,620 cycles                    #    1.532 GHz                     
[83.33%]
 2,198,851,794,211 stalled-cycles-frontend   #   74.10% frontend cycles idle    
[83.33%]
 1,394,951,252,428 stalled-cycles-backend    #   47.01% backend  cycles idle    
[66.68%]
 1,627,172,938,178 instructions              #    0.55  insns per cycle
                                             #    1.35  stalled cycles per insn 
[83.36%]
   279,686,885,670 branches                  #  144.368 M/sec                   
[83.34%]
     8,362,175,551 branch-misses             #    2.99% of all branches         
[83.32%]

     790.709682812 seconds time elapsed



Trunk  + patch , with switch threshold 100ms.
 Performance counter stats for 
'/home/himanshu/10278-patch/hbase-0.99.0-SNAPSHOT/bin/hbase 
org.apache.hadoop.hbase.regionserver.wal.HLogPerformanceEvaluation -iterations 
1000000 -threads 10':

    1926591.375141 task-clock                #    2.416 CPUs utilized
        55,231,306 context-switches          #    0.029 M/sec
         1,996,458 CPU-migrations            #    0.001 M/sec
            62,600 page-faults               #    0.032 K/sec
 2,938,081,049,913 cycles                    #    1.525 GHz                     
[83.34%]
 2,174,078,968,852 stalled-cycles-frontend   #   74.00% frontend cycles idle    
[83.31%]
 1,385,993,249,374 stalled-cycles-backend    #   47.17% backend  cycles idle    
[66.75%]
 1,615,848,452,958 instructions              #    0.55  insns per cycle
                                             #    1.35  stalled cycles per insn 
[83.41%]
   277,855,085,701 branches                  #  144.221 M/sec                   
[83.29%]
     8,449,913,638 branch-misses             #    3.04% of all branches         
[83.31%]

     797.338722847 seconds time elapsed

{code}

With default 1 sec, there is almost 0 extra cost. Note that these tests were 
done with no n/w hiccup injection. The tests where network hiccups are 
injected, we see that WAL Switch functionality is very effective in overcoming 
a bad hdfs pipeline.

This patch also adds the metrics for WALSwitching, number of WAL switches, and 
number of inflight Append ops used when switching. I also took care of comments 
from the review board that were applicable with this approach.

[[email protected]]:
bq.Any issues interrupting? I've found interrupting hdfs a PITA or rather, the 
variety of exceptions that can come up are many... its tricky figuring which 
can be caught and which not.
Yes, interrupting the SyncRunners causes variety of exceptions. But, I 
interrupt them when the SyncFuture they were interrupting are done by the 
monitor, and handlers are also freed. Also, I let them die when interrupted, so 
no special handling as such to keep them alive, etc.
But, some interesting race arises when old SyncRunners resumes itself (or tries 
to release SyncFutures after interruption in its 'finally' block), and the 
SyncFuture, which was already freed by the SyncMonitor, is also present as "not 
done" in the RingBuffer (or with some other sync runner) because the 
regionserver-handler has done some appends and has put it again in the 
RingBuffer. I fixed that by comparing the last completed Sequence Id of the 
SyncFuture with the 'offered' value of the SyncRunner. 

bq.Set FSHLog#switching true. Every new append or is it sync must run over this 
new volatile?
No longer needed.

3) Grab their Append lists (i.e., whatever they were trying to sync). 
Consolidate, and sort it. These are the "in-flight" edits we need to append to 
the new Writer.
'sort'? We've given these items their seqid at this stage, right? Will the sort 
mess this up?
Yes, the sequence Ids are present. 
Nah, I keep a linked list of these inflight edits when they are about to be 
appended, so that sort would not mess the region sequence Ids sorting. I added 
a test case with 20 threads inserting 5k entries with switch threshold to 10ms. 
It verifies that there is no "out-of-order" edits when it is done writing. It 
is a heavy test (also, takes about 90 sec on my local), but pretty good for 
testing the correctness of this functionality.

@jon, [~jeanmarcc]:
bq. Metrics would be great and could be done in a critica/mustdo follow-on patch
I added metrics for number of WAL switches, and total number of in-flight edits 
that got re-appended while doing WAL switching. 
bq. "we have waiting an average of" before switching, in "total operation took 
x ms", and "in total it will have taken x ms" (based on the duration of the 
first thread)? 
I added some metrics , but didn't quite follow the above ones. 

> Provide better write predictability
> -----------------------------------
>
>                 Key: HBASE-10278
>                 URL: https://issues.apache.org/jira/browse/HBASE-10278
>             Project: HBase
>          Issue Type: New Feature
>            Reporter: Himanshu Vashishtha
>            Assignee: Himanshu Vashishtha
>         Attachments: 10278-wip-1.1.patch, Multiwaldesigndoc.pdf, 
> SwitchWriterFlow.pptx
>
>
> Currently, HBase has one WAL per region server. 
> Whenever there is any latency in the write pipeline (due to whatever reasons 
> such as n/w blip, a node in the pipeline having a bad disk, etc), the overall 
> write latency suffers. 
> Jonathan Hsieh and I analyzed various approaches to tackle this issue. We 
> also looked at HBASE-5699, which talks about adding concurrent multi WALs. 
> Along with performance numbers, we also focussed on design simplicity, 
> minimum impact on MTTR & Replication, and compatibility with 0.96 and 0.98. 
> Considering all these parameters, we propose a new HLog implementation with 
> WAL Switching functionality.
> Please find attached the design doc for the same. It introduces the WAL 
> Switching feature, and experiments/results of a prototype implementation, 
> showing the benefits of this feature.
> The second goal of this work is to serve as a building block for concurrent 
> multiple WALs feature.
> Please review the doc.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to