[ 
https://issues.apache.org/jira/browse/HADOOP-4775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654768#action_12654768
 ] 

Brian Bockelman commented on HADOOP-4775:
-----------------------------------------

I finally found a process that was stuck which had a usable stack trace, below.

It appears that the JVM is deadlocking when hdfsPread invokes jni_NewByteArray.

So,
1) Can I somehow adjust the "DFSClient buffer per file" as Pete suggests above? 
 I wasn't able to find a config option for this.
2) I'll be upgrading the JVM to 1.6.0-11 from 1.6.0-7.
3) I'll allocate more memory for the fuse_dfs client
4) Is there anything else which can be done from the Hadoop side to avoid 
deadlocking on this call?

(gdb) bt
#0  0x0000003587b08b3a in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib64/tls/libpthread.so.0
#1  0x0000002a95c6ba5f in Monitor::wait () from 
/mnt/nfs04/opt-2/osg-wn-source/jdk1.5/jre/lib/amd64/server/libjvm.so
#2  0x0000002a95ded329 in VMThread::execute () from 
/mnt/nfs04/opt-2/osg-wn-source/jdk1.5/jre/lib/amd64/server/libjvm.so
#3  0x0000002a95c95938 in ParallelScavengeHeap::mem_allocate () from 
/mnt/nfs04/opt-2/osg-wn-source/jdk1.5/jre/lib/amd64/server/libjvm.so
#4  0x0000002a9591cad7 in CollectedHeap::common_mem_allocate_noinit () from 
/mnt/nfs04/opt-2/osg-wn-source/jdk1.5/jre/lib/amd64/server/libjvm.so
#5  0x0000002a95db9977 in typeArrayKlass::allocate () from 
/mnt/nfs04/opt-2/osg-wn-source/jdk1.5/jre/lib/amd64/server/libjvm.so
#6  0x0000002a95ae7af8 in jni_NewByteArray () from 
/mnt/nfs04/opt-2/osg-wn-source/jdk1.5/jre/lib/amd64/server/libjvm.so
#7  0x0000002a9555a891 in hdfsPread (fs=Variable "fs" is not available.
) at hdfs.c:708
#8  0x00000000004027f2 in dfs_read ()
#9  0x0000002a9566b7f2 in fuse_lib_read (req=0x70f480, ino=Variable "ino" is 
not available.
) at fuse.c:1961
#10 0x0000002a9566ff93 in do_read (req=0x11a6b, nodeid=0, inarg=Variable 
"inarg" is not available.
) at fuse_lowlevel.c:623
#11 0x0000002a9566edb0 in fuse_do_work (data=Variable "data" is not available.
) at fuse_loop_mt.c:100
#12 0x0000003587b06137 in start_thread () from /lib64/tls/libpthread.so.0
#13 0x00000035874c9883 in clone () from /lib64/tls/libc.so.6


> FUSE crashes reliably on 0.19.0
> -------------------------------
>
>                 Key: HADOOP-4775
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4775
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/fuse-dfs
>            Reporter: Brian Bockelman
>            Priority: Critical
>         Attachments: fuse_lotsofmem_bt.txt, fuse_lotsofmem_pmap.txt
>
>
> Every morning I come in and find many nodes which have developed the dreaded 
> "Transport endpoint not connected" error overnight.  This has only started 
> after the 0.19.0 upgrade.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to