[
https://issues.apache.org/jira/browse/ARTEMIS-3045?focusedWorklogId=548981&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-548981
]
ASF GitHub Bot logged work on ARTEMIS-3045:
-------------------------------------------
Author: ASF GitHub Bot
Created on: 06/Feb/21 05:46
Start Date: 06/Feb/21 05:46
Worklog Time Spent: 10m
Work Description: franz1981 edited a comment on pull request #3392:
URL: https://github.com/apache/activemq-artemis/pull/3392#issuecomment-774405162
@clebertsuconic @michaelandrepearce @jbertram @brusdev @gtully
This change seems to perform best, but I cannot say I am satisfied, because
I see a problem both on this and the original implementation ie we are not
handling back-pressure back to the replicated journal and beyond.
This remind me of
https://cassandra.apache.org/blog/2020/09/03/improving-resiliency.html for who
is interested..
TLDR:
- in the master implementation we can make the broker to go OOM by adding
too many Runnables to the replication stream because it can block awaiting
Netty writability for 30 seconds and stop consuming the tasks
- in this PR first implementation we can have a JCTools q of the outstanding
packet requests that can grow unbounded for the same reason and makes the
broker to go OOM again
- in this PR last implementation we can have the Netty internal outbound
(off-heap) buffer that can grow unbounded
Despite the first 2 solutions seems better at first look, because they wait
for enough room in Netty buffer, we can still get OOM under the same
circumstances ie while awaiting the backup to catch-up.
We should:
- track the amount of pending work for monitoring
- try to propagate back-pressure until clients
But given that Artemis clients are of very different types, probably we
should (similar to Cassandra) setAutoread false to client connections, but that
means that we rely on client to save themselves from OOM.
Or, depending by client type, we can stop sending credits back to clients to
slow'em.
At worst we should stop accepting connecting clients too (but is too
drastic, because maybe they won't make the broker to replicate anything).
I cannot say what's the best option here and if we already use some form of
end-to-end protection I cannot see here, but it doesn't seem the case, given
that many parallel client can still cause overloading much before receiving
back notification of a durable local write + backup notification.
Any thought?
IMO solving this correctly can bring a huge performance increase with an
improved stability too.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 548981)
Time Spent: 4h 40m (was: 4.5h)
> ReplicationManager can batch sent replicated packets
> ----------------------------------------------------
>
> Key: ARTEMIS-3045
> URL: https://issues.apache.org/jira/browse/ARTEMIS-3045
> Project: ActiveMQ Artemis
> Issue Type: Improvement
> Reporter: Francesco Nigro
> Assignee: Francesco Nigro
> Priority: Major
> Time Spent: 4h 40m
> Remaining Estimate: 0h
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)