[
https://issues.apache.org/jira/browse/ARTEMIS-2941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17222125#comment-17222125
]
ASF subversion and git services commented on ARTEMIS-2941:
----------------------------------------------------------
Commit 647151b0aff8f1245735bfbc6e8d22d1cdee0afb in activemq-artemis's branch
refs/heads/master from gtully
[ https://gitbox.apache.org/repos/asf?p=activemq-artemis.git;h=647151b ]
ARTEMIS-2941 - renew tasks are nearly always a little late, make this test more
tolerant of that
> Improve JDBC HA connection resiliency
> -------------------------------------
>
> Key: ARTEMIS-2941
> URL: https://issues.apache.org/jira/browse/ARTEMIS-2941
> Project: ActiveMQ Artemis
> Issue Type: Improvement
> Components: Broker
> Affects Versions: 2.15.0
> Reporter: Francesco Nigro
> Assignee: Francesco Nigro
> Priority: Major
> Time Spent: 2h
> Remaining Estimate: 0h
>
> This is aiming to replace the restart enhancement feature of
> https://issues.apache.org/jira/browse/ARTEMIS-2918 because this last one is
> too dangerous due to the numerous potential leaks that a server in production
> could hit by allowing it to restart while keeping the Java process around.
> Currently, JDBC HA uses an expiration time on locks that mark the time by
> which a server instance is allowed to keep a specific role, dependent by the
> owned lock (live or backup).
> Right now, the first failed attempt to renew such expiration time force a
> broker to shutdown immediately, while it could be more "relaxed" and just
> keep retry until the very end ie when the expiration time is approaching to
> end.
>
> The only concern of this feature is related to the relation between the
> broker wall-clock time and the DBMS one, that's used to set the expiration
> time and that should be within certain margins.
> For this last part I'm aware that classic ActiveMQ lease locks use some
> configuration parameter to set the magnitude of the allowed difference (and
> to compute some base offset too).
>
> Right now this feature seems more risk-free and appealing then
> https://issues.apache.org/jira/browse/ARTEMIS-2918, given it narrows the
> scope of it to what's the very core issue ie a more resilient behaviour on
> JDBC lost connectivity.
>
> To understand the implications of such change, consider a shared store HA
> pair with configured 60 seconds of expiration time:
> # DBMS goes down
> # an in-flight persistent operation on the live data store cause the live
> broker to kill itself immediately, because no reliable storage is connected
> # backup is unable to renew its backup lease lock
> # DBMS goes up in time, before the backup lock local expiration time is ended
> # backup is able to renew its backup lease lock and retrieve the very last
> state of live (that was live) and, if no script is configured to restart the
> live, to failover and take its role
> # backup is now live and able to serve clients
>
>
> There are 2 legit questions re potential improvements on this:
> # why the live cannot keep re-trying I/O (on the journal, paging or large
> messages) until its local expiration time end?
> # why the live isn't just returning back an I/O error to the clients?
>
> The former is complex: the main problem I see is from the resource
> utilization point of view; keeping an accumulating backlog of pending
> requests, blocked awaiting the last one for an arbitrary long time will
> probably cause the broker memory to blown up, to not mention that clients
> will timed out too.
> The latter seems more appealing, because will allow clients to fail fast, but
> it would affect the current semantic we use on the broker storage operations
> and I need more investigation to understand how to implement it.
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)