[ 
https://issues.apache.org/jira/browse/HBASE-21738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16746156#comment-16746156
 ] 

Zheng Hu commented on HBASE-21738:
----------------------------------

[~stack],  the root cause is here: 

{code}
HRegion#internalPrepareFlushCache
  |--> StoreFlusherImpl#prepare()
     |--> DefaultMemStore#snapshot()
        |--> new MemStoreSnapshot(...)
           |--> snapshot.getCellsCount()
              |--> Segment#getCellsCount
                 |--> CellSet#size
                   |--> ConcurrentSkipListMap#size()
{code}

The ConcurrentSkipListMap#size() is an quite time consuming operation, because 
it designed for better concurrence.
{code}
    /**
     * Returns the number of key-value mappings in this map.  If this map
     * contains more than {@code Integer.MAX_VALUE} elements, it
     * returns {@code Integer.MAX_VALUE}.
     *
     * <p>Beware that, unlike in most collections, this method is
     * <em>NOT</em> a constant-time operation. Because of the
     * asynchronous nature of these maps, determining the current
     * number of elements requires traversing them all to count them.
     * Additionally, it is possible for the size to change during
     * execution of this method, in which case the returned result
     * will be inaccurate. Thus, this method is typically not very
     * useful in concurrent applications.
     *
     * @return the number of elements in this map
     */
    public int size() {
        long count = 0;
        for (Node<K,V> n = findFirst(); n != null; n = n.next) {
            if (n.getValidValue() != null)
                ++count;
        }
        return (count >= Integer.MAX_VALUE) ? Integer.MAX_VALUE : (int) count;
    }
{code}

So,  we should remove all the CSLM#size operation in our memstore , I think.. 

> Latency spike happen when memstore flushing in 100% put case
> ------------------------------------------------------------
>
>                 Key: HBASE-21738
>                 URL: https://issues.apache.org/jira/browse/HBASE-21738
>             Project: HBase
>          Issue Type: Sub-task
>          Components: Performance
>            Reporter: Zheng Hu
>            Assignee: Zheng Hu
>            Priority: Critical
>         Attachments: add-some-log.patch, image-2019-01-18-14-03-28-662.png, 
> log.txt
>
>
> Made some performance test for 100% put case in branch-2 before. 
> We can see that there are many  latency peak  in p999 latency curve , and the 
> peak time are almost the point time which our region is flushing. 
> See the [hbase20-ssd-put-10000000000-rows-latencys-and-qps 
> |https://issues.apache.org/jira/secure/attachment/12955341/12955341_image-2019-01-18-14-03-28-662.png]
> And, I used the 
> [add-some-log.patch|https://issues.apache.org/jira/secure/attachment/12955342/add-some-log.patch]
>  to log some time consuming when we grab the update.writeLock() to make a 
> memstore snapshot.   Tested again, I found those logs in [log.txt. 
> |https://issues.apache.org/jira/secure/attachment/12955343/log.txt]
> Seems most of the time was consumed when taking memstore snapshot.. Let me 
> dig into this.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to