[
https://issues.apache.org/jira/browse/ROCKETMQ-272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16139963#comment-16139963
]
ASF GitHub Bot commented on ROCKETMQ-272:
-----------------------------------------
GitHub user evthoriz opened a pull request:
https://github.com/apache/incubator-rocketmq/pull/153
[ROCKETMQ-272] Fix sync slave timeout when using SYNC_MASTER
Jira: https://issues.apache.org/jira/browse/ROCKETMQ-272
The timeout logic doesn't work correctly.
Thread waiting in GroupTransferService may frequently waked up by
ReadSocketService in HAConnection.
So the transfer logic may return soon and wake up the thread waiting for
the HA handling, which will make the timeout value in HA handling useless.
This patch repairs the timeout logic in syncing, and also introduces an
option `syncSlaveTimeout` in `MessageStoreConfig` to distinguish from the disk
flush timeout option.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/evthoriz/incubator-rocketmq debug-ha
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-rocketmq/pull/153.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #153
----
commit 6f2501a24a701368b6213fd5acb3355ebdaafeb6
Author: evthoriz <[email protected]>
Date: 2017-08-24T11:50:20Z
[ROCKETMQ-272] Fix sync slave timeout when using SYNC_MASTER
----
> The config `syncFlushTimeout` doesn't work for SYNC_MASTER
> ----------------------------------------------------------
>
> Key: ROCKETMQ-272
> URL: https://issues.apache.org/jira/browse/ROCKETMQ-272
> Project: Apache RocketMQ
> Issue Type: Bug
> Components: rocketmq-broker
> Affects Versions: 4.1.0-incubating
> Reporter: Yu Kaiyuan
> Assignee: yukon
>
> It's quite frequent to get result as `sendStatus=FLUSH_SLAVE_TIMEOUT` when
> sending big messages(>500k) in SYNC_MASTER/SLAVE scenario.
> The timeout value used by the sync process currently as I found, is the
> config `syncFlushTimeout`. And its default value is 5000 milliseconds.
> But it shows that producer get the result as `FLUSH_SLAVE_TIMEOUT` less than
> 1 second.
> So why does the config not work as expected?
> Relevant code:
> {code:java}
> // CommitLog.java
> public void handleHA(AppendMessageResult result, PutMessageResult
> putMessageResult, MessageExt messageExt) {
> if (BrokerRole.SYNC_MASTER ==
> this.defaultMessageStore.getMessageStoreConfig().getBrokerRole()) {
> HAService service = this.defaultMessageStore.getHaService();
> if (messageExt.isWaitStoreMsgOK()) {
> // Determine whether to wait
> if (service.isSlaveOK(result.getWroteOffset() +
> result.getWroteBytes())) {
> GroupCommitRequest request = new
> GroupCommitRequest(result.getWroteOffset() + result.getWroteBytes());
> service.putRequest(request);
> service.getWaitNotifyObject().wakeupAll();
> boolean flushOK =
>
> request.waitForFlush(this.defaultMessageStore.getMessageStoreConfig().getSyncFlushTimeout());
> if (!flushOK) {
> log.error("do sync transfer other node, wait return, but
> failed, topic: " + messageExt.getTopic() + " tags: "
> + messageExt.getTags() + " client address: " +
> messageExt.getBornHostNameString());
>
> putMessageResult.setPutMessageStatus(PutMessageStatus.FLUSH_SLAVE_TIMEOUT);
> }
> }
> // Slave problem
> else {
> // Tell the producer, slave not available
>
> putMessageResult.setPutMessageStatus(PutMessageStatus.SLAVE_NOT_AVAILABLE);
> }
> }
> }
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)