[ 
https://issues.apache.org/jira/browse/HBASE-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15385242#comment-15385242
 ] 

Allan Yang edited comment on HBASE-15213 at 7/20/16 3:12 AM:
-------------------------------------------------------------

After a little search, I found out that the thought of removing the WriteQueue 
is already raised by someone else, but soon denied. It seems that if all 
operations are synced by the WAL, then indeed we don't need a queue to ensure 
the monotone increasing of MVCC. But there are other operations which don't 
sync WAL. So we still need a queue to ensure that. Let me quote two comments in 
HBASE-8763
{quote}
Jeffrey Zhong added a comment - 03/Dec/13 08:00
Today I had some discussion with Enis Soztutar and Ted Yu on this topic and 
found it might be possible to handle the JIRA issue in a simpler way. Below are 
the steps:
1) Memstore insert using long.max as the initial write number
2) append no sync
3) sync
4) update WriteEntry's write number to the sequence number returned from Step 2
5) CompleteMemstoreInsert. In this step, make current read point to be >= the 
sequence number from Step 2. The reasoning behind this is that once we sync 
till the sequence number, all changes with small sequence numbers are already 
synced into WAL. Therefore, we should be able to bump up read number to the 
last sequence number synced.
Currently, we maintain an internal queue which might defer the read point bump 
up if transactions complete order is different than that of MVCC internal write 
queue.
By doing above, it's possible to remove the logics maintaining writeQueue so it 
means we can remove two locking and one queue loop in write code path. Sounds 
too good to be true . Let me try to write a quick patch and run it against unit 
tests to see if the idea could fly.
{quote}

{quote}
Jeffrey Zhong added a comment - 03/Dec/13 11:55
I tried a small patch. Since we support SKIP_WAL model, the MVCC.writeQueue is 
still needed to main the write order because there is no wal sync operation at 
all. Also there are quite a few test cases doesn't do appendNosync between 
mvcc.beginMemstoreInsert and mvcc.completeMemstoreInsert so they are needed to 
be adjusted. So far I didn't find block issues but still need to verify it 
thoroughly.
{quote}



was (Author: allan163):
After a little search, I found out that the thought of removing the WriteQueue 
is already raised by someone else, but soon denied. It seems that if all 
oprations are synced by the WAL, then indedd we don't need a queue to ensure 
the monotone increasing of MVCC. But there are other operations which don't 
sync WAL. So we still need a queue to ensure that. Let me quote two commnet in 
HBASE-8763
{quote}
Jeffrey Zhong added a comment - 03/Dec/13 08:00
Today I had some discussion with Enis Soztutar and Ted Yu on this topic and 
found it might be possible to handle the JIRA issue in a simpler way. Below are 
the steps:
1) Memstore insert using long.max as the initial write number
2) append no sync
3) sync
4) update WriteEntry's write number to the sequence number returned from Step 2
5) CompleteMemstoreInsert. In this step, make current read point to be >= the 
sequence number from Step 2. The reasoning behind this is that once we sync 
till the sequence number, all changes with small sequence numbers are already 
synced into WAL. Therefore, we should be able to bump up read number to the 
last sequence number synced.
Currently, we maintain an internal queue which might defer the read point bump 
up if transactions complete order is different than that of MVCC internal write 
queue.
By doing above, it's possible to remove the logics maintaining writeQueue so it 
means we can remove two locking and one queue loop in write code path. Sounds 
too good to be true . Let me try to write a quick patch and run it against unit 
tests to see if the idea could fly.
{quote}

{quote}
Jeffrey Zhong added a comment - 03/Dec/13 11:55
I tried a small patch. Since we support SKIP_WAL model, the MVCC.writeQueue is 
still needed to main the write order because there is no wal sync operation at 
all. Also there are quite a few test cases doesn't do appendNosync between 
mvcc.beginMemstoreInsert and mvcc.completeMemstoreInsert so they are needed to 
be adjusted. So far I didn't find block issues but still need to verify it 
thoroughly.
{quote}


> Fix increment performance regression caused by HBASE-8763 on branch-1.0
> -----------------------------------------------------------------------
>
>                 Key: HBASE-15213
>                 URL: https://issues.apache.org/jira/browse/HBASE-15213
>             Project: HBase
>          Issue Type: Sub-task
>          Components: Performance
>            Reporter: Junegunn Choi
>            Assignee: Junegunn Choi
>             Fix For: 1.1.4, 1.0.4
>
>         Attachments: 15157v3.branch-1.1.patch, HBASE-15213-increment.png, 
> HBASE-15213.branch-1.0.patch, HBASE-15213.v1.branch-1.0.patch
>
>
> This is an attempt to fix the increment performance regression caused by 
> HBASE-8763 on branch-1.0.
> I'm aware that hbase.increment.fast.but.narrow.consistency was added to 
> branch-1.0 (HBASE-15031) to address the issue and a separate work is ongoing 
> on master branch, but anyway, this is my take on the problem.
> I read through HBASE-14460 and HBASE-8763 but it wasn't clear to me what 
> caused the slowdown but I could indeed reproduce the performance regression.
> Test setup:
> - Server: 4-core Xeon 2.4GHz Linux server running mini cluster (100 handlers, 
> JDK 1.7)
> - Client: Another box of the same spec
> - Increments on random 10k records on a single-region table, recreated every 
> time
> Increment throughput (TPS):
> || Num threads || Before HBASE-8763 (d6cc2fb) || branch-1.0 || branch-1.0 
> (narrow-consistency) ||
> || 1            | 2661                         | 2486        | 2359  |
> || 2            | 5048                         | 5064        | 4867  |
> || 4            | 7503                         | 8071        | 8690  |
> || 8            | 10471                        | 10886       | 13980 |
> || 16           | 15515                        | 9418        | 18601 |
> || 32           | 17699                        | 5421        | 20540 |
> || 64           | 20601                        | 4038        | 25591 |
> || 96           | 19177                        | 3891        | 26017 |
> We can clearly observe that the throughtput degrades as we increase the 
> number of concurrent requests, which led me to believe that there's severe 
> context switching overhead and I could indirectly confirm that suspicion with 
> cs entry in vmstat output. branch-1.0 shows a much higher number of context 
> switches even with much lower throughput.
> Here are the observations:
> - WriteEntry in the writeQueue can only be removed by the very handler that 
> put it, only when it is at the front of the queue and marked complete.
> - Since a WriteEntry is marked complete after the wait-loop, only one entry 
> can be removed at a time.
> - This stringent condition causes O(N^2) context switches where n is the 
> number of concurrent handlers processing requests.
> So what I tried here is to mark WriteEntry complete before we go into 
> wait-loop. With the change, multiple WriteEntries can be shifted at a time 
> without context switches. I changed writeQueue to LinkedHashSet since fast 
> containment check is needed as WriteEntry can be removed by any handler.
> The numbers look good, it's virtually identical to pre-HBASE-8763 era.
> || Num threads || branch-1.0 with fix ||
> || 1            | 2459                 |
> || 2            | 4976                 |
> || 4            | 8033                 |
> || 8            | 12292                |
> || 16           | 15234                |
> || 32           | 16601                |
> || 64           | 19994                |
> || 96           | 20052                |
> So what do you think about it? Please let me know if I'm missing anything.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to