[ https://issues.apache.org/jira/browse/HADOOP-4775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654768#action_12654768 ]
Brian Bockelman commented on HADOOP-4775: ----------------------------------------- I finally found a process that was stuck which had a usable stack trace, below. It appears that the JVM is deadlocking when hdfsPread invokes jni_NewByteArray. So, 1) Can I somehow adjust the "DFSClient buffer per file" as Pete suggests above? I wasn't able to find a config option for this. 2) I'll be upgrading the JVM to 1.6.0-11 from 1.6.0-7. 3) I'll allocate more memory for the fuse_dfs client 4) Is there anything else which can be done from the Hadoop side to avoid deadlocking on this call? (gdb) bt #0 0x0000003587b08b3a in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 #1 0x0000002a95c6ba5f in Monitor::wait () from /mnt/nfs04/opt-2/osg-wn-source/jdk1.5/jre/lib/amd64/server/libjvm.so #2 0x0000002a95ded329 in VMThread::execute () from /mnt/nfs04/opt-2/osg-wn-source/jdk1.5/jre/lib/amd64/server/libjvm.so #3 0x0000002a95c95938 in ParallelScavengeHeap::mem_allocate () from /mnt/nfs04/opt-2/osg-wn-source/jdk1.5/jre/lib/amd64/server/libjvm.so #4 0x0000002a9591cad7 in CollectedHeap::common_mem_allocate_noinit () from /mnt/nfs04/opt-2/osg-wn-source/jdk1.5/jre/lib/amd64/server/libjvm.so #5 0x0000002a95db9977 in typeArrayKlass::allocate () from /mnt/nfs04/opt-2/osg-wn-source/jdk1.5/jre/lib/amd64/server/libjvm.so #6 0x0000002a95ae7af8 in jni_NewByteArray () from /mnt/nfs04/opt-2/osg-wn-source/jdk1.5/jre/lib/amd64/server/libjvm.so #7 0x0000002a9555a891 in hdfsPread (fs=Variable "fs" is not available. ) at hdfs.c:708 #8 0x00000000004027f2 in dfs_read () #9 0x0000002a9566b7f2 in fuse_lib_read (req=0x70f480, ino=Variable "ino" is not available. ) at fuse.c:1961 #10 0x0000002a9566ff93 in do_read (req=0x11a6b, nodeid=0, inarg=Variable "inarg" is not available. ) at fuse_lowlevel.c:623 #11 0x0000002a9566edb0 in fuse_do_work (data=Variable "data" is not available. ) at fuse_loop_mt.c:100 #12 0x0000003587b06137 in start_thread () from /lib64/tls/libpthread.so.0 #13 0x00000035874c9883 in clone () from /lib64/tls/libc.so.6 > FUSE crashes reliably on 0.19.0 > ------------------------------- > > Key: HADOOP-4775 > URL: https://issues.apache.org/jira/browse/HADOOP-4775 > Project: Hadoop Core > Issue Type: Bug > Components: contrib/fuse-dfs > Reporter: Brian Bockelman > Priority: Critical > Attachments: fuse_lotsofmem_bt.txt, fuse_lotsofmem_pmap.txt > > > Every morning I come in and find many nodes which have developed the dreaded > "Transport endpoint not connected" error overnight. This has only started > after the 0.19.0 upgrade. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.