[
https://issues.apache.org/jira/browse/IGNITE-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15872091#comment-15872091
]
Ivan Veselovsky commented on IGNITE-4720:
-----------------------------------------
Node logs show that some nodes were failed and excluded from the topology
during the test. This happens nearly at the same time when the test failures
observed:
{code}
....
[16:19:00,784][WARN ][tcp-comm-worker-#1%null%][TcpCommunicationSpi] Connect
timed out (consider increasing 'failureDetectionTimeout' configuration
property) [addr=/127.0.0.1:47103, failureDetectionTimeout=1000 0]
130 [16:19:00,788][WARN ][tcp-comm-worker-#1%null%][TcpCommunicationSpi]
Connect timed out (consider increasing 'failureDetectionTimeout' configuration
property) [addr=/172.25.2.17:47103, failureDetectionTimeout=10 000]
131 [16:19:00,788][WARN ][tcp-comm-worker-#1%null%][TcpCommunicationSpi] Failed
to connect to a remote node (make sure that destination node is alive and
operating system firewall is disabled on local and remote ho sts)
[addrs=[/127.0.0.1:47103, /172.25.2.17:47103]]
132 [16:19:00,789][WARN ][tcp-comm-worker-#1%null%][TcpCommunicationSpi]
TcpCommunicationSpi failed to establish connection to node, node will be
dropped from cluster [rmtNode=TcpDiscoveryNode [id=ca2c554d-6d48-4a6
1-abe5-d1d188cc3f53, addrs=[127.0.0.1, 172.25.2.17],
sockAddrs=[/172.25.2.17:47503, /127.0.0.1:47503], discPort=47503, order=4,
intOrder=4, lastExchangeTime=1487337395028, loc=false,
ver=1.8.3#20170217-sha1:924 93562, isClient=false], err=class
o.a.i.IgniteCheckedException: Failed to connect to node (is node still alive?).
Make sure that each ComputeTask and cache Transaction has a timeout set in
order to prevent part ies from waiting forever in case of network issues
[nodeId=ca2c554d-6d48-4a61-abe5-d1d188cc3f53, addrs=[/127.0.0.1:47103,
/172.25.2.17:47103]], connectErrs=[class o.a.i.IgniteCheckedException: Failed
to connect to address: /127.0.0.1:47103, class
o.a.i.IgniteCheckedException: Failed to connect to address: /172.25.2.17:47103]]
133 [16:19:00,795][WARN ][disco-event-worker-#29%null%][GridDiscoveryManager]
Node FAILED: TcpDiscoveryNode [id=ca2c554d-6d48-4a61-abe5-d1d188cc3f53,
addrs=[127.0.0.1, 172.25.2.17], sockAddrs=[/172.25.2.17:47503, /
127.0.0.1:47503], discPort=47503, order=4, intOrder=4,
lastExchangeTime=1487337395028, loc=false, ver=1.8.3#20170217-sha1:92493562,
isClient=false]
134 [16:19:00,796][INFO ][disco-event-worker-#29%null%][GridDiscoveryManager]
Topology snapshot [ver=5, servers=3, clients=0, CPUs=4, heap=4.4GB]
....
{code}
>From the logs and configs it appears that igfs:// is used as default file
>system, so we may need to run the same tests with e.g. file:// file system to
>exclude IGFS.
Similar failures related to timeouts were observed in Ignite clusters under
high load when running Map-Reduce jobs. Special configuration tuning (increased
timeouts, etc.) were used to overcome the problem.
> Sporadically fails for Hadoop
> -----------------------------
>
> Key: IGNITE-4720
> URL: https://issues.apache.org/jira/browse/IGNITE-4720
> Project: Ignite
> Issue Type: Bug
> Components: hadoop
> Affects Versions: 1.8
> Reporter: Irina Zaporozhtseva
> Assignee: Ivan Veselovsky
> Fix For: 1.9
>
>
> hadoop example aggregatewordcount under apache ignite hadoop edition grid
> with 4 nodes for hadoop-2_6_4 and hadoop-2_7_2:
> aggregatewordcount returns 999712 instead of 1000000
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)