[ 
https://issues.apache.org/jira/browse/HBASE-19290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16263720#comment-16263720
 ] 

binlijin commented on HBASE-19290:
----------------------------------

[~tedyu]
bq. Assuming patch v3 is very close to the version you run in the 2000+ node 
production cluster, can you post some performance numbers (in terms of 
reduction in zookeeper requests) so that we can know its effectiveness ?

We do not record the performance numbers.
But without the patch we can see the HMaster get the zookeeper event very very 
slowly...

HMaster put up split task:

*2017-07-11 20:22:57,608* DEBUG [main-EventThread] 
coordination.SplitLogManagerCoordination: put up splitlog task at znode 
/hbase/splitWAL/WALs%2Fhadoop0448.et2.tbsite.net%2C16020%2C1495647366007-splitting%2Fhadoop0448.et2.tbsite.net%252C16020%252C1495647366007.regiongroup-2.1499768090548

RegionServer grab the task and done it.

*2017-07-11 20:23:33,689* INFO  [SplitLogWorker-hadoop1435:16020] 
coordination.ZkSplitLogWorkerCoordination: worker 
hadoop1435.et2.tbsite.net,16020,1495647366458 acquired task 
/hbase/splitWAL/WALs%2Fhadoop0448.et2.tbsite.net%2C16020%2C1495647366007-splitting%2Fhadoop0448.et2.tbsite.net%252C16020%252C1495647366007.regiongroup-2.1499768090548

*2017-07-11 20:25:47,131* INFO  [RS_LOG_REPLAY_OPS-hadoop1435:16020-1] 
coordination.ZkSplitLogWorkerCoordination: successfully transitioned task 
/hbase/splitWAL/WALs%2Fhadoop0448.et2.tbsite.net%2C16020%2C1495647366007-splitting%2Fhadoop0448.et2.tbsite.net%252C16020%252C1495647366007.regiongroup-2.1499768090548
 to final state DONE hadoop1435.et2.tbsite.net,16020,1495647366458

HMaster get the task done event and delete it:

*2017-07-11 20:49:52,879* INFO  [main-EventThread] 
coordination.SplitLogManagerCoordination: task 
/hbase/splitWAL/WALs%2Fhadoop0448.et2.tbsite.net%2C16020%2C1495647366007-splitting%2Fhadoop0448.et2.tbsite.net%252C16020%252C1495647366007.regiongroup-2.1499768090548
 entered state: DONE hadoop1435.et2.tbsite.net,16020,1495647366458

*2017-07-11 20:49:52,881* INFO  [main-EventThread] 
coordination.SplitLogManagerCoordination: Done splitting 
/hbase/splitWAL/WALs%2Fhadoop0448.et2.tbsite.net%2C16020%2C1495647366007-splitting%2Fhadoop0448.et2.tbsite.net%252C16020%252C1495647366007.regiongroup-2.1499768090548

*2017-07-11 21:19:52,280* DEBUG [main-EventThread] 
coordination.ZKSplitLogManagerCoordination$DeleteAsyncCallback: deleted 
/hbase/splitWAL/WALs%2Fhadoop0448.et2.tbsite.net%2C16020%2C1495647366007-splitting%2Fhadoop0448.et2.tbsite.net%252C16020%252C1495647366007.regiongroup-2.1499768090548


> Reduce zk request when doing split log
> --------------------------------------
>
>                 Key: HBASE-19290
>                 URL: https://issues.apache.org/jira/browse/HBASE-19290
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: binlijin
>            Assignee: binlijin
>         Attachments: HBASE-19290.master.001.patch, 
> HBASE-19290.master.002.patch, HBASE-19290.master.003.patch, 
> HBASE-19290.master.004.patch
>
>
> We observe once the cluster has 1000+ nodes and when hundreds of nodes abort 
> and doing split log, the split is very very slow, and we find the 
> regionserver and master wait on the zookeeper response, so we need to reduce 
> zookeeper request and pressure for big cluster.
> (1) Reduce request to rsZNode, every time calculateAvailableSplitters will 
> get rsZNode's children from zookeeper, when cluster is huge, this is heavy. 
> This patch reduce the request. 
> (2) When the regionserver has max split tasks running, it may still trying to 
> grab task and issue zookeeper request, we should sleep and wait until we can 
> grab tasks again.  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to