[jira] [Commented] (IGNITE-5580) Improve node failure cause information

Andrey Gura (JIRA) Mon, 12 Mar 2018 11:14:26 -0700

    [ 
https://issues.apache.org/jira/browse/IGNITE-5580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16395634#comment-16395634
 ]


Andrey Gura commented on IGNITE-5580:
-------------------------------------

[~SomeFire], [~agoncharuk],

I think that issue scope is too broad and it should be separated to different 
issue on order to make scope management easier.

Anyway I have some comments about this changes:

* Events gathering approach is too specific and can be generalized because, I 
believe, it can be used by other components.
* {{latestEventsString()}} method on {{DiscoverySpi}} looks strange for me. 
Moreover, this method does too many things: retrieves events and makes string. 
* {{ConcurrentLinkedQueue}} isn't good choice for event storage. Method 
{{size()}} has O( n) complexity and contention is possible on event 
adding/removing. IMO, circular buffer is more suitable data structure in this 
case.
* Events gathering is always switched on. I think we should provide possibility 
to switch on/off this feature at runtime.

We can start discussion here or on dev list.

[~agoncharuk] I think that existing IEPs don't fit to this problem and new IEP 
should be initiated.

> Improve node failure cause information
> --------------------------------------
>
>                 Key: IGNITE-5580
>                 URL: https://issues.apache.org/jira/browse/IGNITE-5580
>             Project: Ignite
>          Issue Type: Improvement
>          Components: general
>    Affects Versions: 1.7
>            Reporter: Alexey Goncharuk
>            Assignee: Ryabov Dmitrii
>            Priority: Major
>              Labels: observability
>             Fix For: 2.5
>
>
> When a node fails, we do not print out any information about the root cause 
> of this failure. This makes it extremely hard to investigate the failure 
> causes - I need to find a previous node for the failed node and check the 
> logs on the previous node.
> I suggest that we add extensive information about the reason of the node 
> failure and the sequence of events that led to this, e.g.:
> [time] [NODE] Sending a message to next node - failed _because_ - write 
> timeout, read timeout, ...?
> [time] [NODE] Connection check - failed - why? Connection refused, handshake 
> timed out, ...?
> ...
> [time] [NODE] Decided to drop the node because of the sequence above
> Maybe we do not need to print out this information always, but we do need 
> this when troubleshooting logger is enabled.
> Also, DiscoverySpi should collect a set of latest important events and dump 
> these events in case of local node segmentation. This will allow users to 
> match the events in the cluster and events on local node and get to the 
> bottom of the failure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (IGNITE-5580) Improve node failure cause information

Reply via email to