Repository: incubator-geode Updated Branches: refs/heads/develop 3bdd10497 -> 3822c9053
GEODE-2047 Document change to enable-network-partition-detection - This is a subtask of GEODE-762. - The default value of property enable-network-partition-detection changed from false to true, enabling partition detection by default, so all documentation that discusses partition detection also needs to change. - Fixed a minor typo or two encountered in the files that were being updated. Project: http://git-wip-us.apache.org/repos/asf/incubator-geode/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-geode/commit/8f14a744 Tree: http://git-wip-us.apache.org/repos/asf/incubator-geode/tree/8f14a744 Diff: http://git-wip-us.apache.org/repos/asf/incubator-geode/diff/8f14a744 Branch: refs/heads/develop Commit: 8f14a744c6bc51c422e4f292dc67219f740dc7ba Parents: 820f33e Author: Karen Miller <[email protected]> Authored: Mon Oct 31 16:45:29 2016 -0700 Committer: Karen Miller <[email protected]> Committed: Tue Nov 1 13:52:22 2016 -0700 ---------------------------------------------------------------------- .../handling_network_partitioning.html.md.erb | 28 +++++++++++--------- ...rk_partitioning_management_works.html.md.erb | 7 +++-- ...ring_conflicting_data_exceptions.html.md.erb | 4 +-- .../recovering_from_network_outages.html.md.erb | 11 ++------ .../system_failure_and_recovery.html.md.erb | 6 ++--- .../topics/gemfire_properties.html.md.erb | 4 +-- 6 files changed, 27 insertions(+), 33 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-geode/blob/8f14a744/geode-docs/managing/network_partitioning/handling_network_partitioning.html.md.erb ---------------------------------------------------------------------- diff --git a/geode-docs/managing/network_partitioning/handling_network_partitioning.html.md.erb b/geode-docs/managing/network_partitioning/handling_network_partitioning.html.md.erb index 61a2576..a227597 100644 --- a/geode-docs/managing/network_partitioning/handling_network_partitioning.html.md.erb +++ b/geode-docs/managing/network_partitioning/handling_network_partitioning.html.md.erb @@ -19,23 +19,24 @@ See the License for the specific language governing permissions and limitations under the License. --> -This section lists the configuration steps for network partition detection. +This section lists configuration considerations relating to network partition detection. <a id="handling_network_partitioning__section_EAF1957B6446491A938DEFB06481740F"></a> The system uses a combination of member coordinators and system members, designated as lead members, to detect and resolve network partitioning problems. -1. Network partition detection works in all environments. Using multiple locators mitigates the effect of network partitioning. See [Configuring Peer-to-Peer Discovery](../../topologies_and_comm/p2p_configuration/setting_up_a_p2p_system.html). -2. Enable partition detection consistently in all system members by setting this in their `gemfire.properties` file: +- Network partition detection works in all environments. Using multiple locators mitigates the effect of network partitioning. See [Configuring Peer-to-Peer Discovery](../../topologies_and_comm/p2p_configuration/setting_up_a_p2p_system.html). + +- Network partition detection is enabled by default. The default setting in the `gemfire.properties` file is ``` pre enable-network-partition-detection=true ``` - Enable network partition detection in all locators and in any other process that should be sensitive to network partitioning. Processes that do not have network partition detection enabled are not eligible to be the lead member, so their failure will not trigger declaration of a network partition. + Processes that do not have network partition detection enabled are not eligible to be the lead member, so their failure will not trigger declaration of a network partition. - All system members should have the same setting for `enable-network-partition-detection`. If they donât, the system throws a `GemFireConfigException` upon startup. + All system members should have the same setting for `enable-network-partition-detection`. If they do not, the system throws a `GemFireConfigException` upon startup. -3. You must set `enable-network-partition-detection` to true if you are using persistent partitioned regions. You **must** set `enable-network-partition-detection` to true if you are using persistent regions (partitioned or replicated). If you create a persistent region and `enable-network-partition-detection` to set to false, you will receive the following warning message: +- The property `enable-network-partition-detection` must be true if you are using either partitioned or persistent regions. If you create a persistent region and `enable-network-partition-detection` to set to false, you will receive the following warning message: ``` pre Creating persistent region {0}, but enable-network-partition-detection is set to false. @@ -43,9 +44,9 @@ The system uses a combination of member coordinators and system members, designa event of a network split." ``` -4. Configure regions you want to protect from network partitioning with `DISTRIBUTED_ACK` or `GLOBAL` `scope`. Do not use `DISTRIBUTED_NO_ACK` `scope`. The region configurations provided in the region shortcut settings use `DISTRIBUTED_ACK` scope. This setting prevents operations from performed throughout the distributed system before a network partition is detected. +- Configure regions you want to protect from network partitioning with a scope setting of `DISTRIBUTED_ACK` or `GLOBAL`. Do not use `DISTRIBUTED_NO_ACK` scope. This prevents operations from being performed throughout the distributed system before a network partition is detected. **Note:** - GemFire issues an alert if it detects distributed-no-ack regions when network partition detection is enabled: + GemFire issues an alert if it detects `DISTRIBUTED_NO_ACK` regions when network partition detection is enabled: ``` pre Region {0} is being created with scope {1} but enable-network-partition-detection is enabled in the distributed system. @@ -53,11 +54,12 @@ The system uses a combination of member coordinators and system members, designa ``` -5. These other configuration parameters affect or interact with network partitioning detection. Check whether they are appropriate for your installation and modify as needed. - - If you have network partition detection enabled, the threshold percentage value for allowed membership weight loss is automatically configured to 51. You cannot modify this value. (**Note:** The weight loss calculation uses standard rounding. Therefore, a value of 50.51 is rounded to 51 and will cause a network partition.) - - Failure detection is initiated if a member's `gemfire.properties` `ack-wait-threshold` (default is 15 seconds) and `ack-severe-alert-threshold` (15 seconds) elapses before receiving a response to a message. If you modify the `ack-wait-threshold` configuration value, you should modify `ack-severe-alert-threshold` to match the other configuration value. - - If the system has clients connecting to it, the clients' `cache.xml` `<cache> <pool> read-timeout` should be set to at least three times the `member-timeout` setting in the server's `gemfire.properties`. The default `<cache> <pool> read-timeout` setting is 10000 milliseconds. +- These other configuration parameters affect or interact with network partitioning detection. Check whether they are appropriate for your installation and modify as needed. + - If you have network partition detection enabled, the threshold percentage value for allowed membership weight loss is automatically configured to 51. You cannot modify this value. **Note:** The weight loss calculation uses round to nearest. Therefore, a value of 50.51 is rounded to 51 and will cause a network partition. + - Failure detection is initiated if a member's `ack-wait-threshold` (default is 15 seconds) and `ack-severe-alert-threshold` (15 seconds) properties elapse before receiving a response to a message. If you modify the `ack-wait-threshold` configuration value, you should modify `ack-severe-alert-threshold` to match the other configuration value. + - If the system has clients connecting to it, the clients' `cache.xml` pool `read-timeout` should be set to at least three times the `member-timeout` setting in the server's `gemfire.properties` file. The default pool `read-timeout` setting is 10000 milliseconds. - You can adjust the default weights of members by specifying the system property `gemfire.member-weight` upon startup. For example, if you have some VMs that host a needed service, you could assign them a higher weight upon startup. - - By default, members that are forced out of the distributed system by a network partition event will automatically restart and attempt to reconnect. Data members will attempt to reinitialize the cache. See [Handling Forced Cache Disconnection Using Autoreconnect](../autoreconnect/member-reconnect.html). + +- By default, members that are forced out of the distributed system by a network partition event will automatically restart and attempt to reconnect. Data members will attempt to reinitialize the cache. See [Handling Forced Cache Disconnection Using Autoreconnect](../autoreconnect/member-reconnect.html). http://git-wip-us.apache.org/repos/asf/incubator-geode/blob/8f14a744/geode-docs/managing/network_partitioning/how_network_partitioning_management_works.html.md.erb ---------------------------------------------------------------------- diff --git a/geode-docs/managing/network_partitioning/how_network_partitioning_management_works.html.md.erb b/geode-docs/managing/network_partitioning/how_network_partitioning_management_works.html.md.erb index e971634..93a14ac 100644 --- a/geode-docs/managing/network_partitioning/how_network_partitioning_management_works.html.md.erb +++ b/geode-docs/managing/network_partitioning/how_network_partitioning_management_works.html.md.erb @@ -24,10 +24,9 @@ Geode handles network outages by using a weighting system to determine whether t <a id="how_network_partitioning_management_works__section_548146BB8C24412CB7B43E6640272882"></a> Individual members are each assigned a weight, and the quorum is determined by comparing the total weight of currently responsive members to the previous total weight of responsive members. -Your distributed system can split into separate running systems when members lose the ability to see each other. The typical cause of this problem is a failure in the network. When a partitioned system is detected, Apache Geode only one side of the system keeps running and the other side automatically shuts down. +Your distributed system can split into separate running systems when members lose the ability to see each other. The typical cause of this problem is a failure in the network. When a partitioned system is detected, only one side of the system keeps running and the other side automatically shuts down. -**Note:** -The network partitioning detection feature is only enabled when `enable-network-partition-detection` is set to true in `gemfire.properties`. By default, this property is set to false. See [Configure Apache Geode to Handle Network Partitioning](handling_network_partitioning.html#handling_network_partitioning) for details. Quorum weight calculations are always performed and logged regardless of this configuration setting. +The network partitioning detection feature is enabled by default with a true value for the `enable-network-partition-detection` property. See [Configure Apache Geode to Handle Network Partitioning](handling_network_partitioning.html#handling_network_partitioning) for details. Quorum weight calculations are always performed and logged regardless of this configuration setting. The overall process for detecting a network partition is as follows: @@ -52,7 +51,7 @@ The overall process for detecting a network partition is as follows: - A new coordinator may have a stale view of membership if it did not see the last membership view sent by the previous (failed) coordinator. If new members were added during that failure, then the new members may be ignored when the first new view is sent out. - If members were removed during the fail over to the new coordinator, then the new coordinator will have to determine these losses during the view preparation step. -6. With `enable-network-partition-detection` set to true, any member that detects that the total membership weight has dropped below 51% within a single membership view change (loss of quorum) declares a network partition event. The coordinator sends a network-partitioned-detected UDP message to all members (even to the non-responsive ones) and then closes the distributed system with a `ForcedDisconnectException`. If a member fails to receive the message before the coordinator closes the system, the member is responsible for detecting the event on its own. +6. With a default value of `enable-network-partition-detection`, any member that detects that the total membership weight has dropped below 51% within a single membership view change (loss of quorum) declares a network partition event. The coordinator sends a network-partitioned-detected UDP message to all members (even to the non-responsive ones) and then closes the distributed system with a `ForcedDisconnectException`. If a member fails to receive the message before the coordinator closes the system, the member is responsible for detecting the event on its own. The presumption is that when a network partition is declared, the members that comprise a quorum will continue operations. The surviving members elect a new coordinator, designate a lead member, and so on. http://git-wip-us.apache.org/repos/asf/incubator-geode/blob/8f14a744/geode-docs/managing/troubleshooting/recovering_conflicting_data_exceptions.html.md.erb ---------------------------------------------------------------------- diff --git a/geode-docs/managing/troubleshooting/recovering_conflicting_data_exceptions.html.md.erb b/geode-docs/managing/troubleshooting/recovering_conflicting_data_exceptions.html.md.erb index 38375ae..4eade62 100644 --- a/geode-docs/managing/troubleshooting/recovering_conflicting_data_exceptions.html.md.erb +++ b/geode-docs/managing/troubleshooting/recovering_conflicting_data_exceptions.html.md.erb @@ -46,7 +46,7 @@ In this case the fix is simply to move aside or delete the persistent files for ## A Network Failure Occurs and Network Partitioning Detection is Disabled -When `enable-network-partition-detection` is set to true, Geode will detect a network partition and shut down unreachable members to prevent a network partition ("split brain") from occurring. No conflicts should occur when the system is healed. +When `enable-network-partition-detection` is set to the default value of true, Geode will detect a network partition and shut down unreachable members to prevent a network partition ("split brain") from occurring. No conflicts should occur when the system is healed. However if `enable-network-partition-detection` is false, Geode will not detect the network partition. Instead, each side of the network partition will end up recording that the other side of the partition has stale data. When the partition is healed and persistent members are restarted, the members will report a conflict because both sides of the partition think the other members are stale. @@ -54,7 +54,7 @@ In some cases it may be possible to choose between sides of the network partitio ## Salvaging Data -If you receive a ConflictingPersistentDataException, you will not be able to start all of your members and have them join the same distributed system. You have some members with conflicting data. +If you receive a `ConflictingPersistentDataException`, you will not be able to start all of your members and have them join the same distributed system. You have some members with conflicting data. First, see if there is part of the system that you can recover. For example if you just added some new members to the system, try to start up without including those members. http://git-wip-us.apache.org/repos/asf/incubator-geode/blob/8f14a744/geode-docs/managing/troubleshooting/recovering_from_network_outages.html.md.erb ---------------------------------------------------------------------- diff --git a/geode-docs/managing/troubleshooting/recovering_from_network_outages.html.md.erb b/geode-docs/managing/troubleshooting/recovering_from_network_outages.html.md.erb index 8c23bea..f798b2b 100644 --- a/geode-docs/managing/troubleshooting/recovering_from_network_outages.html.md.erb +++ b/geode-docs/managing/troubleshooting/recovering_from_network_outages.html.md.erb @@ -23,16 +23,9 @@ The safest response to a network outage is to restart all the processes and brin However, if you know the architecture of your system well, and you are sure you wonât be resurrecting old data, you can do a selective restart. At the very least, you must restart all the members on one side of the network failure, because a network outage causes separate distributed systems that canât rejoin automatically. -- [What Happens During a Network Outage](recovering_from_network_outages.html#rec_network_crash__section_900657018DC048EE9BE6A8064FAE48FD) -- [Recovery Procedure](recovering_from_network_outages.html#rec_network_crash__section_F9A0C31AE25C4E7185DF3B1A8486BDFA) -- [Effect of Network Failure on Partitioned Regions](recovering_from_network_outages.html#rec_network_crash__section_9914A63673E64EA1ADB6B6767879F0FF) -- [Effect of Network Failure on Distributed Regions](recovering_from_network_outages.html#rec_network_crash__section_7AD5624F3CD748C0BC163562B26B2DCE) -- [Effect of Network Failure on Persistent Regions](#rec_network_crash__section_arm_pnr_3q) -- [Effect of Network Failure on Client/Server Installations](recovering_from_network_outages.html#rec_network_crash__section_18AEEB6CC8004C3388CCB01F988B0422) - ## <a id="rec_network_crash__section_900657018DC048EE9BE6A8064FAE48FD" class="no-quick-link"></a>What Happens During a Network Outage -When the network connecting members of a distributed system goes down, system members treat this like a machine crash. Members on each side of the network failure respond by removing the members on the other side from the membership list. If network partitioning detection is enabled, the partition that contains sufficient quorum (> 51% based on member weight) will continue to operate, while the other partition with insufficient quorum will shut down. See [Network Partitioning](../network_partitioning/chapter_overview.html#network_partitioning) for a detailed explanation on how this detection system operates. +When the network connecting members of a distributed system goes down, system members treat this like a machine crash. Members on each side of the network failure respond by removing the members on the other side from the membership list. If network partitioning detection is enabled (the default), the partition that contains sufficient quorum (> 51% based on member weight) will continue to operate, while the other partition with insufficient quorum will shut down. See [Network Partitioning](../network_partitioning/chapter_overview.html#network_partitioning) for a detailed explanation on how this detection system operates. In addition, members that have been disconnected either via network partition or due to unresponsiveness will automatically try to reconnect to the distributed system unless configured otherwise. See [Handling Forced Cache Disconnection Using Autoreconnect](../autoreconnect/member-reconnect.html). @@ -62,7 +55,7 @@ When the network recovers, the members may be able to see each other again, but A network failure when using persistent regions can cause conflicts in your persisted data. When you recover your system, you will likely encounter `ConflictingPersistentDataException`s when members start up. -For this reason, you must configure `enable-network-partition-detection` to `true` if you are using persistent regions. +For this reason, `enable-network-partition-detection` must be set to true if you are using persistent regions. For information on how to recover from `ConflictingPersistentDataException` errors should they occur, see [Recovering from ConfictingPersistentDataExceptions](recovering_conflicting_data_exceptions.html#topic_ghw_z2m_jq). http://git-wip-us.apache.org/repos/asf/incubator-geode/blob/8f14a744/geode-docs/managing/troubleshooting/system_failure_and_recovery.html.md.erb ---------------------------------------------------------------------- diff --git a/geode-docs/managing/troubleshooting/system_failure_and_recovery.html.md.erb b/geode-docs/managing/troubleshooting/system_failure_and_recovery.html.md.erb index d94ea60..cce80d0 100644 --- a/geode-docs/managing/troubleshooting/system_failure_and_recovery.html.md.erb +++ b/geode-docs/managing/troubleshooting/system_failure_and_recovery.html.md.erb @@ -181,7 +181,7 @@ There are no processes eligible to be group membership coordinator Description: -Network partition detection is enabled (enable-network-partition-detection is set to true), and there are locator problems. +Network partition detection is enabled, and there are locator problems. Response: @@ -197,7 +197,7 @@ There are no processes eligible to be group membership coordinator Description: -Network partition detection is enabled (enable-network-partition-detection is set to true), and there are locator problems. +Network partition detection is enabled, and there are locator problems. Response: @@ -212,7 +212,7 @@ Unable to contact any locators and network partition detection is enabled Description: -Network partition detection is enabled (enable-network-partition-detection is set to true), and there are locator problems. +Network partition detection is enabled, and there are locator problems. Response: http://git-wip-us.apache.org/repos/asf/incubator-geode/blob/8f14a744/geode-docs/reference/topics/gemfire_properties.html.md.erb ---------------------------------------------------------------------- diff --git a/geode-docs/reference/topics/gemfire_properties.html.md.erb b/geode-docs/reference/topics/gemfire_properties.html.md.erb index 9882568..ae0f198 100644 --- a/geode-docs/reference/topics/gemfire_properties.html.md.erb +++ b/geode-docs/reference/topics/gemfire_properties.html.md.erb @@ -160,8 +160,8 @@ See <a href="../../managing/autoreconnect/member-reconnect.html">Handling Forced </tr> <tr class="odd"> <td>enable-network-partition-detection</td> -<td>Boolean instructing the system to detect and handle splits in the distributed system, typically caused by a partitioning of the network (split brain) where the distributed system is running. We recommend setting this property to <code class="ph codeph">true</code>. You must set this property to the same value across all your distributed system members. In addition, you must set this property to <code class="ph codeph">true</code> if you are using persistent regions and configure your regions to use DISTRIBUTED_ACK or GLOBAL scope to avoid potential data conflicts.</td> -<td>false</td> +<td>Boolean instructing the system to detect and handle splits in the distributed system, typically caused by a partitioning of the network (split brain) where the distributed system is running. You must set this property to the same value across all your distributed system members. In addition, this property must be set to <code class="ph codeph">true</code> if you are using persistent regions and configure your regions to use DISTRIBUTED_ACK or GLOBAL scope to avoid potential data conflicts.</td> +<td>true</td> </tr> <tr class="even"> <td>enable-cluster-configuration</td>
