[ 
https://issues.apache.org/jira/browse/IGNITE-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15872091#comment-15872091
 ] 

Ivan Veselovsky commented on IGNITE-4720:
-----------------------------------------

Node logs show that some nodes were failed and excluded from the topology 
during the test. This happens nearly at the same time when the test failures 
observed:
{code} 
....
 [16:19:00,784][WARN ][tcp-comm-worker-#1%null%][TcpCommunicationSpi] Connect 
timed out (consider increasing 'failureDetectionTimeout' configuration 
property) [addr=/127.0.0.1:47103, failureDetectionTimeout=1000    0]
130 [16:19:00,788][WARN ][tcp-comm-worker-#1%null%][TcpCommunicationSpi] 
Connect timed out (consider increasing 'failureDetectionTimeout' configuration 
property) [addr=/172.25.2.17:47103, failureDetectionTimeout=10    000]
131 [16:19:00,788][WARN ][tcp-comm-worker-#1%null%][TcpCommunicationSpi] Failed 
to connect to a remote node (make sure that destination node is alive and 
operating system firewall is disabled on local and remote ho    sts) 
[addrs=[/127.0.0.1:47103, /172.25.2.17:47103]]
132 [16:19:00,789][WARN ][tcp-comm-worker-#1%null%][TcpCommunicationSpi] 
TcpCommunicationSpi failed to establish connection to node, node will be 
dropped from cluster [rmtNode=TcpDiscoveryNode [id=ca2c554d-6d48-4a6    
1-abe5-d1d188cc3f53, addrs=[127.0.0.1, 172.25.2.17], 
sockAddrs=[/172.25.2.17:47503, /127.0.0.1:47503], discPort=47503, order=4, 
intOrder=4, lastExchangeTime=1487337395028, loc=false, 
ver=1.8.3#20170217-sha1:924    93562, isClient=false], err=class 
o.a.i.IgniteCheckedException: Failed to connect to node (is node still alive?). 
Make sure that each ComputeTask and cache Transaction has a timeout set in 
order to prevent part    ies from waiting forever in case of network issues 
[nodeId=ca2c554d-6d48-4a61-abe5-d1d188cc3f53, addrs=[/127.0.0.1:47103, 
/172.25.2.17:47103]], connectErrs=[class o.a.i.IgniteCheckedException: Failed 
to connect     to address: /127.0.0.1:47103, class 
o.a.i.IgniteCheckedException: Failed to connect to address: /172.25.2.17:47103]]
133 [16:19:00,795][WARN ][disco-event-worker-#29%null%][GridDiscoveryManager] 
Node FAILED: TcpDiscoveryNode [id=ca2c554d-6d48-4a61-abe5-d1d188cc3f53, 
addrs=[127.0.0.1, 172.25.2.17], sockAddrs=[/172.25.2.17:47503, /    
127.0.0.1:47503], discPort=47503, order=4, intOrder=4, 
lastExchangeTime=1487337395028, loc=false, ver=1.8.3#20170217-sha1:92493562, 
isClient=false]
134 [16:19:00,796][INFO ][disco-event-worker-#29%null%][GridDiscoveryManager] 
Topology snapshot [ver=5, servers=3, clients=0, CPUs=4, heap=4.4GB]
....
{code}

>From the logs and configs it appears that  igfs:// is used as default file 
>system, so we may need to run the same tests with e.g. file:// file system to 
>exclude IGFS.
Similar failures related to timeouts were observed in Ignite clusters under 
high load when running Map-Reduce jobs. Special configuration tuning (increased 
timeouts, etc.) were used to overcome the problem.

> Sporadically fails for Hadoop
> -----------------------------
>
>                 Key: IGNITE-4720
>                 URL: https://issues.apache.org/jira/browse/IGNITE-4720
>             Project: Ignite
>          Issue Type: Bug
>          Components: hadoop
>    Affects Versions: 1.8
>            Reporter: Irina Zaporozhtseva
>            Assignee: Ivan Veselovsky
>             Fix For: 1.9
>
>
> hadoop example aggregatewordcount under apache ignite hadoop edition grid 
> with 4 nodes for hadoop-2_6_4 and hadoop-2_7_2:
> aggregatewordcount returns 999712 instead of 1000000



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to