liyuzhou created ROCKETMQ-384:
---------------------------------

             Summary: broker flow control is abnormal when the machine's 
physical time drift
                 Key: ROCKETMQ-384
                 URL: https://issues.apache.org/jira/browse/ROCKETMQ-384
             Project: Apache RocketMQ
          Issue Type: Improvement
          Components: rocketmq-broker
    Affects Versions: 4.2.0
         Environment: operating system: CentOS6.0
hardware: 8C8G VM
version: rocketmq-4.2.0
broker config: default 2m-2s-async config
producer qps: 4000
pre message byte: 10 bytes
            Reporter: liyuzhou
            Assignee: yukon


    When I did a performance test for broker,I found large numbers of exception 
like this:
 send error com.alibaba.rocketmq.client.exception.MQBrokerException: CODE: 2 
DESC: [REJECTREQUEST]system busy, start flow control for a while
    After I have read the related source code,I think this exception occurs 
when the broker think its OS is busy, it will refuse the increasing request. 
But I saw the server's monitor, the JVM gc is normal(normally 10ms), max cost 
50ms and performed at a frequency of 10s once, and my server's CPU and disk IO 
also is health.But my server's NTP offset occasionally drift more than 2s.So I 
think the pyhsical time caused the flow control.
 related code:
{code:java}
CommitLog.java
public PutMessageResult putMessage(final MessageExtBrokerInner msg) {
    ... ...
    long beginLockTimestamp = this.defaultMessageStore.getSystemClock().now(); 
//now() is System.currentTimeMillis()
    this.beginTimeInLock = beginLockTimestamp;
    ... ...
}
DefaultMessageStore.java
public boolean isOSPageCacheBusy() {
    long begin = this.getCommitLog().getBeginTimeInLock();
    long diff = this.systemClock.now() - begin;

    return diff < 10000000
      && diff > this.messageStoreConfig.getOsPageCacheBusyTimeOutMills();
}
{code}

    Assume the first request run into the CommitLog's method putMessage, the 
physical clock is 0s, the server's physical clock drift to 2s, then the second 
request run into the isOSPageCacheBusy in order to check if the system is busy, 
so the second request is rejected by the broker for the clock drift.
    So should we replace System.currentTimeMillis() with System.nanoTime() to 
reduce the abnormal request rejection?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to