[
https://issues.apache.org/jira/browse/HBASE-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13843925#comment-13843925
]
Feng Honghua commented on HBASE-8755:
-------------------------------------
[~v.himanshu]:
bq.1) log rolling thread safety: Log rolling happens in parallel with
flush/sync. Currently, in FSHLog, sync call grabs the updateLock to ensure it
has a non-null writer (because of parallel log rolling). How does this patch
address the non-null writer? Or is it not needed anymore? Also, if you go for
the updatelock in sync, that might result in deadlock.
==> Good question:-). in rollWriter(), before switching this.writer to the
newly created writer, *updateLock is held* and *cleanupCurrentWriter() is
called*. *updateLock is held* guarantees no new edits enters pendingWrites,
and *cleanupCurrentWriter() is called* guarantees all edits in current
pendingWrites must be written to hdfs and sync-ed(inside this method 'sync()'
is called to provide this guarantee). This means when switching this.writer to
newly HLog writer, no new edits enter and all current edits have already been
sync-ed to hdfs, all AsyncWriter/AsyncSyncer threads have nothing to do and are
idle, so no log rolling thread safety issue here
bq.2) Error handling: It is not very clear how is flush/sync failures are being
handled?... Let's say there are two handlers waiting for sync, t1 on txid 8 and
t2 on txid 10. And, t1 wakes up on notification. Would t1 also get this
exception? Wouldn't it be wrong, because txid 8 may have succeeded? Please
correct me if I missed anything.
==> your understanding here is correct, but the write semantic of HBase is
'successful write response means a successful write, but failed write response
can mean either a successful write or a failed write', right? and I have
already mentioned this in above comment. this behavior can be improved by
adding a txid-range to txid mapping, this mapping can help exactly indicate a
failed txid fail which pending txid-range
bq.3) I think the current HLogPE doesn't do justice to the real use
case...Almost all HLog calls are appendNoSync, followed by a sync call. In the
current HLogPE, we are calling append calls, which also does the sync
==> you're right, but from an whole view of write process, these two have no
big difference concerning performance: append() calls sync() inside, and in
real case of HRegion.java, appendNoSync and sync is called, since sync now is a
pending on notification, the difference of these two behavior is sooner/later
to pend on notification, no impact on overall write performance(after put edits
to pendingWrites, write handler thread can just wait for write/sync to hdfs to
finish and can't help/influence write performance)...right?
bq.4) Perf numbers are super impressive. It would have been wonderful to have
such numbers for lesser number of handler threads also (e.g., 5-10 threads).
IMHO, that represents most common case scenario, but I could be wrong. I know
this has been beaten to death in the discussions above, but just echoing my
thoughts here
==> I think maybe many guys hold a same point of view as yours here, but I also
have my personal thought on this:-) Throughput is different from latency:
latency represents how quickly a system perform user requests, the quicker the
better; while throughput represents how many requests a system perform user
requests(concurrently, or to be more accurate, within a given time frame), the
more the better. these two are both indicate a system's capability for
performing user requests, but in different angles. certainly for application
with low-medium write stress less write threads within client to issue requests
is OK, but for application with high write stress, users/clients may feel bad
if system just can't serve/reach their real-world throughput, *no matter* how
many client threads are configured/added. with the improvement of this patch,
at least we *can* satisfy users' such high throughput requirement by adding
client threads. And without improving individual write request's latency(as
this patch does, it does nothing for individual write's latency), it's hard to
improve throughput for *synchronous* client write thread(it can issue next
request only after it's done with the current one), that's why this patch has
less effect for single/less write threads. If HBase supports *asynchronous*
client write thread, I think this patch can also provide big improvement even
with less client write threads.
bq.5) I should also mention that while working on a different use case, I was
trying to bring a layer of indirection b/w regionserver handlers and sync
operation (Sync is the most costly affair in all HLog story). ...
===> glad to see similar approach:-). wonder if have same improvement under
heavy write stress(such as 50/100/200 client write threads). Looking forward to
seeing your final test results:-)
Finally, [~v.himanshu], thank you very much, you raised some *really good*
questions. :-)
> A new write thread model for HLog to improve the overall HBase write
> throughput
> -------------------------------------------------------------------------------
>
> Key: HBASE-8755
> URL: https://issues.apache.org/jira/browse/HBASE-8755
> Project: HBase
> Issue Type: Improvement
> Components: Performance, wal
> Reporter: Feng Honghua
> Assignee: stack
> Priority: Critical
> Attachments: 8755-syncer.patch, 8755trunkV2.txt,
> HBASE-8755-0.94-V0.patch, HBASE-8755-0.94-V1.patch, HBASE-8755-0.96-v0.patch,
> HBASE-8755-trunk-V0.patch, HBASE-8755-trunk-V1.patch,
> HBASE-8755-trunk-v4.patch, HBASE-8755-trunk-v6.patch, HBASE-8755-v5.patch
>
>
> In current write model, each write handler thread (executing put()) will
> individually go through a full 'append (hlog local buffer) => HLog writer
> append (write to hdfs) => HLog writer sync (sync hdfs)' cycle for each write,
> which incurs heavy race condition on updateLock and flushLock.
> The only optimization where checking if current syncTillHere > txid in
> expectation for other thread help write/sync its own txid to hdfs and
> omitting the write/sync actually help much less than expectation.
> Three of my colleagues(Ye Hangjun / Wu Zesheng / Zhang Peng) at Xiaomi
> proposed a new write thread model for writing hdfs sequence file and the
> prototype implementation shows a 4X improvement for throughput (from 17000 to
> 70000+).
> I apply this new write thread model in HLog and the performance test in our
> test cluster shows about 3X throughput improvement (from 12150 to 31520 for 1
> RS, from 22000 to 70000 for 5 RS), the 1 RS write throughput (1K row-size)
> even beats the one of BigTable (Precolator published in 2011 says Bigtable's
> write throughput then is 31002). I can provide the detailed performance test
> results if anyone is interested.
> The change for new write thread model is as below:
> 1> All put handler threads append the edits to HLog's local pending buffer;
> (it notifies AsyncWriter thread that there is new edits in local buffer)
> 2> All put handler threads wait in HLog.syncer() function for underlying
> threads to finish the sync that contains its txid;
> 3> An single AsyncWriter thread is responsible for retrieve all the buffered
> edits in HLog's local pending buffer and write to the hdfs
> (hlog.writer.append); (it notifies AsyncFlusher thread that there is new
> writes to hdfs that needs a sync)
> 4> An single AsyncFlusher thread is responsible for issuing a sync to hdfs
> to persist the writes by AsyncWriter; (it notifies the AsyncNotifier thread
> that sync watermark increases)
> 5> An single AsyncNotifier thread is responsible for notifying all pending
> put handler threads which are waiting in the HLog.syncer() function
> 6> No LogSyncer thread any more (since there is always
> AsyncWriter/AsyncFlusher threads do the same job it does)
--
This message was sent by Atlassian JIRA
(v6.1.4#6159)