[
https://issues.apache.org/jira/browse/HBASE-21738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16746156#comment-16746156
]
Zheng Hu commented on HBASE-21738:
----------------------------------
[~stack], the root cause is here:
{code}
HRegion#internalPrepareFlushCache
|--> StoreFlusherImpl#prepare()
|--> DefaultMemStore#snapshot()
|--> new MemStoreSnapshot(...)
|--> snapshot.getCellsCount()
|--> Segment#getCellsCount
|--> CellSet#size
|--> ConcurrentSkipListMap#size()
{code}
The ConcurrentSkipListMap#size() is an quite time consuming operation, because
it designed for better concurrence.
{code}
/**
* Returns the number of key-value mappings in this map. If this map
* contains more than {@code Integer.MAX_VALUE} elements, it
* returns {@code Integer.MAX_VALUE}.
*
* <p>Beware that, unlike in most collections, this method is
* <em>NOT</em> a constant-time operation. Because of the
* asynchronous nature of these maps, determining the current
* number of elements requires traversing them all to count them.
* Additionally, it is possible for the size to change during
* execution of this method, in which case the returned result
* will be inaccurate. Thus, this method is typically not very
* useful in concurrent applications.
*
* @return the number of elements in this map
*/
public int size() {
long count = 0;
for (Node<K,V> n = findFirst(); n != null; n = n.next) {
if (n.getValidValue() != null)
++count;
}
return (count >= Integer.MAX_VALUE) ? Integer.MAX_VALUE : (int) count;
}
{code}
So, we should remove all the CSLM#size operation in our memstore , I think..
> Latency spike happen when memstore flushing in 100% put case
> ------------------------------------------------------------
>
> Key: HBASE-21738
> URL: https://issues.apache.org/jira/browse/HBASE-21738
> Project: HBase
> Issue Type: Sub-task
> Components: Performance
> Reporter: Zheng Hu
> Assignee: Zheng Hu
> Priority: Critical
> Attachments: add-some-log.patch, image-2019-01-18-14-03-28-662.png,
> log.txt
>
>
> Made some performance test for 100% put case in branch-2 before.
> We can see that there are many latency peak in p999 latency curve , and the
> peak time are almost the point time which our region is flushing.
> See the [hbase20-ssd-put-10000000000-rows-latencys-and-qps
> |https://issues.apache.org/jira/secure/attachment/12955341/12955341_image-2019-01-18-14-03-28-662.png]
> And, I used the
> [add-some-log.patch|https://issues.apache.org/jira/secure/attachment/12955342/add-some-log.patch]
> to log some time consuming when we grab the update.writeLock() to make a
> memstore snapshot. Tested again, I found those logs in [log.txt.
> |https://issues.apache.org/jira/secure/attachment/12955343/log.txt]
> Seems most of the time was consumed when taking memstore snapshot.. Let me
> dig into this.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)