[
https://issues.apache.org/jira/browse/IGNITE-5580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16303655#comment-16303655
]
ASF GitHub Bot commented on IGNITE-5580:
----------------------------------------
GitHub user SomeFire opened a pull request:
https://github.com/apache/ignite/pull/3289
IGNITE-5580
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/SomeFire/ignite ignite-5580
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/ignite/pull/3289.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #3289
----
commit bb9f8490140de45cbad7989ee6b2c7d3644ea85f
Author: Dmitrii Ryabov <somefireone@...>
Date: 2017-12-19T17:13:48Z
IGNITE-5580: Basic information about fail for TcpDiscoveryNodeFailedMessage.
commit 973039c82ff94a36e6ce5a5125b1709eb5803fdf
Author: Dmitrii Ryabov <somefireone@...>
Date: 2017-12-21T11:17:19Z
IGNITE-5580: Discovery history.
commit 93509d7f01d0e87b106b5973d7045c357517fcb4
Author: Dmitrii Ryabov <somefireone@...>
Date: 2017-12-26T08:42:47Z
IGNITE-5580: Test added.
----
> Improve node failure cause information
> --------------------------------------
>
> Key: IGNITE-5580
> URL: https://issues.apache.org/jira/browse/IGNITE-5580
> Project: Ignite
> Issue Type: Improvement
> Components: general
> Affects Versions: 1.7
> Reporter: Alexey Goncharuk
> Assignee: Ryabov Dmitrii
> Labels: observability
>
> When a node fails, we do not print out any information about the root cause
> of this failure. This makes it extremely hard to investigate the failure
> causes - I need to find a previous node for the failed node and check the
> logs on the previous node.
> I suggest that we add extensive information about the reason of the node
> failure and the sequence of events that led to this, e.g.:
> [time] [NODE] Sending a message to next node - failed _because_ - write
> timeout, read timeout, ...?
> [time] [NODE] Connection check - failed - why? Connection refused, handshake
> timed out, ...?
> ...
> [time] [NODE] Decided to drop the node because of the sequence above
> Maybe we do not need to print out this information always, but we do need
> this when troubleshooting logger is enabled.
> Also, DiscoverySpi should collect a set of latest important events and dump
> these events in case of local node segmentation. This will allow users to
> match the events in the cluster and events on local node and get to the
> bottom of the failure.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)