[ 
https://issues.apache.org/jira/browse/HADOOP-4775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653976#action_12653976
 ] 

Brian Bockelman commented on HADOOP-4775:
-----------------------------------------

Hey Pete,

I'll have our sysadmins try out the 4616 and 4635 patches

There were no messages in syslog, meaning it probably didn't segfault (is this 
correct?)

Here's what the failure looks like:
http://jobrobot.web.cern.ch/JobRobot/errors_081205.html#T2_US_Nebraska
http://jobrobot.web.cern.ch/JobRobot/errors_081204.html#T2_US_Nebraska

I've got a hard time believing that a memory leak alone could disconnect the 
FUSE endpoint... 1/3 of the workers are 4GB, 1/3 are 8GB, 1/3 are 16GB.  It 
would take quite a bit of effort to get a memory leak to cause the problems on 
the 16GB nodes.  Plus, I didn't see OOM killing anything in dmesg.

I set up a debug FUSE instance on a node and hit it with a similar workflow.  
No problems at all; it may be that, in debug mode, FUSE doesn't allow multiple 
threads?

My suspicion is that either FUSE-DFS or libhdfs has a problem with error 
recovery which causes an infinite loop (like we've seen in other places).  The 
interesting thing for the "ps" output I showed above is that the fuse_dfs 
process was using 30% CPU *when nothing was using FUSE* and the node wasn't 
swapping.

Nagios now restarts FUSE-DFS whenever the problem occurs, so I don't get much 
of a chance to debug.  Still, about 7% of our jobs die because FUSE conks out 
mid-job.

> FUSE crashes reliably on 0.19.0
> -------------------------------
>
>                 Key: HADOOP-4775
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4775
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/fuse-dfs
>            Reporter: Brian Bockelman
>            Priority: Critical
>
> Every morning I come in and find many nodes which have developed the dreaded 
> "Transport endpoint not connected" error overnight.  This has only started 
> after the 0.19.0 upgrade.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to