[ 
https://issues.apache.org/jira/browse/HBASE-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13843925#comment-13843925
 ] 

Feng Honghua commented on HBASE-8755:
-------------------------------------

[~v.himanshu]: 
bq.1) log rolling thread safety: Log rolling happens in parallel with 
flush/sync. Currently, in FSHLog, sync call grabs the updateLock to ensure it 
has a non-null writer (because of parallel log rolling). How does this patch 
address the non-null writer? Or is it not needed anymore? Also, if you go for 
the updatelock in sync, that might result in deadlock.
==> Good question:-). in rollWriter(), before switching this.writer to the 
newly created writer, *updateLock is held* and *cleanupCurrentWriter() is 
called*.  *updateLock is held* guarantees no new edits enters pendingWrites, 
and *cleanupCurrentWriter() is called* guarantees all edits in current 
pendingWrites must be written to hdfs and sync-ed(inside this method 'sync()' 
is called to provide this guarantee). This means when switching this.writer to 
newly HLog writer, no new edits enter and all current edits have already been 
sync-ed to hdfs, all AsyncWriter/AsyncSyncer threads have nothing to do and are 
idle, so no log rolling thread safety issue here
bq.2) Error handling: It is not very clear how is flush/sync failures are being 
handled?... Let's say there are two handlers waiting for sync, t1 on txid 8 and 
t2 on txid 10. And, t1 wakes up on notification. Would t1 also get this 
exception? Wouldn't it be wrong, because txid 8 may have succeeded? Please 
correct me if I missed anything.
==> your understanding here is correct, but the write semantic of HBase is 
'successful write response means a successful write, but failed write response 
can mean either a successful write or a failed write', right? and I have 
already mentioned this in above comment. this behavior can be improved by 
adding a txid-range to txid mapping, this mapping can help exactly indicate a 
failed txid fail which pending txid-range
bq.3) I think the current HLogPE doesn't do justice to the real use 
case...Almost all HLog calls are appendNoSync, followed by a sync call. In the 
current HLogPE, we are calling append calls, which also does the sync
==> you're right, but from an whole view of write process, these two have no 
big difference concerning performance: append() calls sync() inside, and in 
real case of HRegion.java, appendNoSync and sync is called, since sync now is a 
pending on notification, the difference of these two behavior is sooner/later 
to pend on notification, no impact on overall write performance(after put edits 
to pendingWrites, write handler thread can just wait for write/sync to hdfs to 
finish and can't help/influence write performance)...right?
bq.4) Perf numbers are super impressive. It would have been wonderful to have 
such numbers for lesser number of handler threads also (e.g., 5-10 threads). 
IMHO, that represents most common case scenario, but I could be wrong. I know 
this has been beaten to death in the discussions above, but just echoing my 
thoughts here
==> I think maybe many guys hold a same point of view as yours here, but I also 
have my personal thought on this:-) Throughput is different from latency: 
latency represents how quickly a system perform user requests, the quicker the 
better; while throughput represents how many requests a system perform user 
requests(concurrently, or to be more accurate, within a given time frame), the 
more the better. these two are both indicate a system's capability for 
performing user requests, but in different angles. certainly for application 
with low-medium write stress less write threads within client to issue requests 
is OK, but for application with high write stress, users/clients may feel bad 
if system just can't serve/reach their real-world throughput, *no matter* how 
many client threads are configured/added. with the improvement of this patch, 
at least we *can* satisfy users' such high throughput requirement by adding 
client threads. And without improving individual write request's latency(as 
this patch does, it does nothing for individual write's latency), it's hard to 
improve throughput for *synchronous* client write thread(it can issue next 
request only after it's done with the current one), that's why this patch has 
less effect for single/less write threads. If HBase supports *asynchronous* 
client write thread, I think this patch can also provide big improvement even 
with less client write threads.
bq.5) I should also mention that while working on a different use case, I was 
trying to bring a layer of indirection b/w regionserver handlers and sync 
operation (Sync is the most costly affair in all HLog story). ...
===> glad to see similar approach:-). wonder if have same improvement under 
heavy write stress(such as 50/100/200 client write threads). Looking forward to 
seeing your final test results:-)

Finally, [~v.himanshu], thank you very much, you raised some *really good* 
questions. :-)

> A new write thread model for HLog to improve the overall HBase write 
> throughput
> -------------------------------------------------------------------------------
>
>                 Key: HBASE-8755
>                 URL: https://issues.apache.org/jira/browse/HBASE-8755
>             Project: HBase
>          Issue Type: Improvement
>          Components: Performance, wal
>            Reporter: Feng Honghua
>            Assignee: stack
>            Priority: Critical
>         Attachments: 8755-syncer.patch, 8755trunkV2.txt, 
> HBASE-8755-0.94-V0.patch, HBASE-8755-0.94-V1.patch, HBASE-8755-0.96-v0.patch, 
> HBASE-8755-trunk-V0.patch, HBASE-8755-trunk-V1.patch, 
> HBASE-8755-trunk-v4.patch, HBASE-8755-trunk-v6.patch, HBASE-8755-v5.patch
>
>
> In current write model, each write handler thread (executing put()) will 
> individually go through a full 'append (hlog local buffer) => HLog writer 
> append (write to hdfs) => HLog writer sync (sync hdfs)' cycle for each write, 
> which incurs heavy race condition on updateLock and flushLock.
> The only optimization where checking if current syncTillHere > txid in 
> expectation for other thread help write/sync its own txid to hdfs and 
> omitting the write/sync actually help much less than expectation.
> Three of my colleagues(Ye Hangjun / Wu Zesheng / Zhang Peng) at Xiaomi 
> proposed a new write thread model for writing hdfs sequence file and the 
> prototype implementation shows a 4X improvement for throughput (from 17000 to 
> 70000+). 
> I apply this new write thread model in HLog and the performance test in our 
> test cluster shows about 3X throughput improvement (from 12150 to 31520 for 1 
> RS, from 22000 to 70000 for 5 RS), the 1 RS write throughput (1K row-size) 
> even beats the one of BigTable (Precolator published in 2011 says Bigtable's 
> write throughput then is 31002). I can provide the detailed performance test 
> results if anyone is interested.
> The change for new write thread model is as below:
>  1> All put handler threads append the edits to HLog's local pending buffer; 
> (it notifies AsyncWriter thread that there is new edits in local buffer)
>  2> All put handler threads wait in HLog.syncer() function for underlying 
> threads to finish the sync that contains its txid;
>  3> An single AsyncWriter thread is responsible for retrieve all the buffered 
> edits in HLog's local pending buffer and write to the hdfs 
> (hlog.writer.append); (it notifies AsyncFlusher thread that there is new 
> writes to hdfs that needs a sync)
>  4> An single AsyncFlusher thread is responsible for issuing a sync to hdfs 
> to persist the writes by AsyncWriter; (it notifies the AsyncNotifier thread 
> that sync watermark increases)
>  5> An single AsyncNotifier thread is responsible for notifying all pending 
> put handler threads which are waiting in the HLog.syncer() function
>  6> No LogSyncer thread any more (since there is always 
> AsyncWriter/AsyncFlusher threads do the same job it does)



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Reply via email to