[ 
https://issues.apache.org/jira/browse/HBASE-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13830678#comment-13830678
 ] 

Feng Honghua commented on HBASE-8755:
-------------------------------------

bq.Do these no longer pass?
=> yes, under new thread model, no explicit method to do the sync and can't 
tell if there is outstanding deferred entries (the AsyncWriter/AsyncSyncer 
threads do write/sync in a best-effort way)

bq.We have hard-coded 5 asyncSyncers? Why 5?
=> yes, I tried 2/3/5/10 and found 5 is the best number (2/3 have worse perf, 
10 has equal perf but introduces too many extra threads)

bq.If we fail to find a free syncer, i don't follow what is going on w/ 
choosing a random syncer and setting txid as in below
=> when fail to find a idle syncer(which is doing sync), choosing a random 
syncer and setting txid that way fall into the same way before introducing 
extra asyncSyncer threads: when asyncWriter pushes new entries to hdfs before 
asyncSyncer sync the previously pushed ones, asyncSyncer gets notified the 
newly pushed txid, but these txid will be synced by next time after asyncSyncer 
is done with the current ones, notice we use txidToFlush to record txid each 
sync is for, and it can't change during each sync, while writtenTxid can change 
during each sync)

To summary: the sync operation is the most time-consuming phase, under old 
write model every write handler issues a separate sync directly for itself(if 
not return early by syncedTillHere). and under new write model, though separate 
threads significantly reduce the lock race, but if concurrent write threads is 
few, the benefit by reducing lock race(fewer write threads, fewer benefit) 
can't offset the inefficiency by using a single asyncSyncer threads(each time 
asyncSyncer thread can only sync for a portion of the writes, but the write 
handlers which already have their entries in buffer or pushed to hdfs also need 
to wait for its completeness, and can't proceed until its next sync phase is 
done)
By introducing extra asyncSyncer threads, the correctness of this model is the 
same as before: still a single asyncWriter thread which push buffered entries 
to hdfs sequentially(txid increases sequentially), and when each asyncSyncer is 
done, it's guaranteed all txids smaller are pushed to hdfs and successfully 
sync-ed.

> A new write thread model for HLog to improve the overall HBase write 
> throughput
> -------------------------------------------------------------------------------
>
>                 Key: HBASE-8755
>                 URL: https://issues.apache.org/jira/browse/HBASE-8755
>             Project: HBase
>          Issue Type: Improvement
>          Components: Performance, wal
>            Reporter: Feng Honghua
>            Assignee: stack
>            Priority: Critical
>         Attachments: 8755trunkV2.txt, HBASE-8755-0.94-V0.patch, 
> HBASE-8755-0.94-V1.patch, HBASE-8755-0.96-v0.patch, 
> HBASE-8755-trunk-V0.patch, HBASE-8755-trunk-V1.patch, 
> HBASE-8755-trunk-v4.patch
>
>
> In current write model, each write handler thread (executing put()) will 
> individually go through a full 'append (hlog local buffer) => HLog writer 
> append (write to hdfs) => HLog writer sync (sync hdfs)' cycle for each write, 
> which incurs heavy race condition on updateLock and flushLock.
> The only optimization where checking if current syncTillHere > txid in 
> expectation for other thread help write/sync its own txid to hdfs and 
> omitting the write/sync actually help much less than expectation.
> Three of my colleagues(Ye Hangjun / Wu Zesheng / Zhang Peng) at Xiaomi 
> proposed a new write thread model for writing hdfs sequence file and the 
> prototype implementation shows a 4X improvement for throughput (from 17000 to 
> 70000+). 
> I apply this new write thread model in HLog and the performance test in our 
> test cluster shows about 3X throughput improvement (from 12150 to 31520 for 1 
> RS, from 22000 to 70000 for 5 RS), the 1 RS write throughput (1K row-size) 
> even beats the one of BigTable (Precolator published in 2011 says Bigtable's 
> write throughput then is 31002). I can provide the detailed performance test 
> results if anyone is interested.
> The change for new write thread model is as below:
>  1> All put handler threads append the edits to HLog's local pending buffer; 
> (it notifies AsyncWriter thread that there is new edits in local buffer)
>  2> All put handler threads wait in HLog.syncer() function for underlying 
> threads to finish the sync that contains its txid;
>  3> An single AsyncWriter thread is responsible for retrieve all the buffered 
> edits in HLog's local pending buffer and write to the hdfs 
> (hlog.writer.append); (it notifies AsyncFlusher thread that there is new 
> writes to hdfs that needs a sync)
>  4> An single AsyncFlusher thread is responsible for issuing a sync to hdfs 
> to persist the writes by AsyncWriter; (it notifies the AsyncNotifier thread 
> that sync watermark increases)
>  5> An single AsyncNotifier thread is responsible for notifying all pending 
> put handler threads which are waiting in the HLog.syncer() function
>  6> No LogSyncer thread any more (since there is always 
> AsyncWriter/AsyncFlusher threads do the same job it does)



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to