Seeing a glusterfs client die oddly.

--Setup--
Client: 
Fedora 12 2.6.32.16-141.fc12.x86_64 
# rpm -qa |egrep 'fuse|glust'
fuse-2.8.4-1.fc12.x86_64
glusterfs-client-3.0.5-1.fc11.x86_64
fuse-libs-2.8.4-1.fc12.x86_64
glusterfs-common-3.0.5-1.fc11.x86_64


Servers - 6 nodes with a 3 x distribute:
Fedora 12 2.6.32.9-70.fc12.x86_64
[[email protected] ~]# rpm -qa | grep glust
glusterfs-common-3.0.5-1.fc11.x86_64
glusterfs-server-3.0.5-1.fc11.x86_64


Process:
1. Client copies a large amount of files to the gluster mount
2. Client tries to do a recursive list of all files copied (ls -R)
3. Recursive list comes across a file where the checksum does not match for 
some reason (see following log snipped)
4. Client dies horribly, the mount point will becoming invalid with the 
following error:
gluster-mount/file: Transport endpoint is not connected

I've tried to keep the snippets below as brief as possible.  If you think the 
volume definition files would help, let me know and i'll be happy to post those 
here as well.

Any help or suggestions are most welcome. 

Thanks!

---

This is the corresponding snipped from 'tail -f gluster-mount.log':

> [2010-07-21 16:34:48] N [client-protocol.c:6288:client_setvolume_cbk] 
> pdbindex2-1: Connected to 192.168.201.88:6996, attached to remote volume 
> 'brick'.

> [2010-07-21 16:35:33] E [afr.c:107:afr_set_split_brain] mirror-0: invalid 
> argument: inode
> [2010-07-21 16:35:33] E [afr-self-heal-algorithm.c:768:sh_diff_checksum_cbk] 
> mirror-0: checksum on /index.201007211105.deploy/file failed on subvolume 
> indexcopy-0 (File descriptor in bad state)
> [2010-07-21 16:35:33] E [afr-self-heal-algorithm.c:768:sh_diff_checksum_cbk] 
> mirror-0: checksum on /index.201007211105.deploy/file failed on subvolume 
> indexcopy-1 (File descriptor in bad state)
> pending frames:
> frame : type(1) op(LOOKUP)
> frame : type(1) op(LOOKUP)
> frame : type(1) op(LOOKUP)
> 
> patchset: v3.0.5
> signal received: 11
> time of crash: 2010-07-21 16:35:33
> configuration details:
> argp 1
> backtrace 1
> dlfcn 1
> fdatasync 1
> libpthread 1
> llistxattr 1
> setfsid 1
> spinlock 1
> epoll.h 1
> xattr.h 1
> st_atim.tv_nsec 1
> package-string: glusterfs 3.0.5
> /lib64/libc.so.6(+0x32740)[0x7fa9c949b740]
> /usr/lib64/glusterfs/3.0.5/xlator/cluster/replicate.so(+0x4b2ea)[0x7fa9c85ff2ea]
> /usr/lib64/glusterfs/3.0.5/xlator/cluster/replicate.so(+0x4b557)[0x7fa9c85ff557]
> /usr/lib64/glusterfs/3.0.5/xlator/cluster/replicate.so(+0x4be10)[0x7fa9c85ffe10]
> /usr/lib64/glusterfs/3.0.5/xlator/cluster/replicate.so(afr_sh_algo_diff+0x196)[0x7fa9c85fffc2]
> /usr/lib64/glusterfs/3.0.5/xlator/cluster/replicate.so(afr_sh_data_sync_prepare+0x256)[0x7fa9c85e9a91]
> /usr/lib64/glusterfs/3.0.5/xlator/cluster/replicate.so(afr_sh_data_fix+0x5db)[0x7fa9c85ea078]
> /usr/lib64/glusterfs/3.0.5/xlator/cluster/replicate.so(afr_sh_data_fstat_cbk+0x167)[0x7fa9c85ea34e]
> /usr/lib64/glusterfs/3.0.5/xlator/cluster/distribute.so(dht_attr_cbk+0x238)[0x7fa9c8820e08]
> /usr/lib64/glusterfs/3.0.5/xlator/protocol/client.so(client_fstat_cbk+0x178)[0x7fa9c8a59868]
> /usr/lib64/glusterfs/3.0.5/xlator/protocol/client.so(protocol_client_interpret+0x1df)[0x7fa9c8a60274]
> /usr/lib64/glusterfs/3.0.5/xlator/protocol/client.so(protocol_client_pollin+0xc6)[0x7fa9c8a60ff5]
> /usr/lib64/glusterfs/3.0.5/xlator/protocol/client.so(notify+0x158)[0x7fa9c8a6154d]
> /usr/lib64/libglusterfs.so.0(xlator_notify+0xd8)[0x7fa9c9c1b639]
> /usr/lib64/glusterfs/3.0.5/transport/socket.so(socket_event_poll_in+0x46)[0x7fa9c6f59249]
> /usr/lib64/glusterfs/3.0.5/transport/socket.so(socket_event_handler+0xc4)[0x7fa9c6f5957c]
> /usr/lib64/libglusterfs.so.0(+0x3eefc)[0x7fa9c9c40efc]
> /usr/lib64/libglusterfs.so.0(+0x3f0ee)[0x7fa9c9c410ee]
> /usr/lib64/libglusterfs.so.0(event_dispatch+0x74)[0x7fa9c9c4140d]
> /usr/sbin/glusterfs(main+0xf53)[0x406187]
> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7fa9c9487b1d]
> /usr/sbin/glusterfs[0x402679]
> ---------

If we look at the respective files, their checksums are fine:
> [16:40] ~> for i in `seq 10 15`; do echo -n "search$i: "; ssh search$i md5sum 
> /data/export/index.201007211105.deploy/file; done
> search10: md5sum: /data/export/index.201007211105.deploy/file: No such file 
> or directory
> search11: 8605b1467bece54ed7ccd13e086ee299  
> /data/export/index.201007211105.deploy/file
> search12: md5sum: /data/export/index.201007211105.deploy/file: No such file 
> or directory
> search13: md5sum: /data/export/index.201007211105.deploy/file: No such file 
> or directory
> search14: 8605b1467bece54ed7ccd13e086ee299  
> /data/export/index.201007211105.deploy/file
> search15: md5sum: /data/export/index.201007211105.deploy/file: No such file 
> or directory

If we look at extended attributes however, we notice that 'trusted.posix.gen' 
is different:
> for i in `seq 10 15`; do echo -n "search$i: "; ssh pdbsearch$i getfattr -d -m 
> - /data/export/index.201007211105.deploy/file; done
> search10: getfattr: /data/export/index.201007211105.deploy/file: No such file 
> or directory
> search11: getfattr: Removing leading '/' from absolute path names
> # file: data/export/index.201007211105.deploy/file
> security.selinux="unconfined_u:object_r:default_t:s0
> trusted.afr.indexcopy-0=0sAAAAAQAAAAAAAAAA
> trusted.afr.indexcopy-1=0sAAAAAQAAAAAAAAAA
> trusted.posix.gen=0sTEFukQAAAEY=
> 
> search12: getfattr: /data/export/index.201007211105.deploy/file: No such file 
> or directory
> search13: getfattr: /data/export/index.201007211105.deploy/file: No such file 
> or directory
> search14: getfattr: Removing leading '/' from absolute path names
> # file: data/export/index.201007211105.deploy/file
> security.selinux="unconfined_u:object_r:default_t:s0
> trusted.afr.indexcopy-0=0sAAAAAQAAAAAAAAAA
> trusted.afr.indexcopy-1=0sAAAAAQAAAAAAAAAA
> trusted.posix.gen=0sTEaPaAAAAAI=
> 
> search15: getfattr: /data/export/index.201007211105.deploy/file: No such file 
> or directory


_______________________________________________
Gluster-users mailing list
[email protected]
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Reply via email to