Hello, some weeks ago, i send report to tell you that's glusterfs 3.x reboot our system when we are testing some ha ( desactivate network interface : ifconfig eth0 down). You cannot reproduce into your systems.
Reboot of our system is due to : hung_task_panic and hung_task_timeout_secs , when a task is blocking during 120 s , linux kernel does panic. so set ung_task_panic to 0 or hung_task_timeout_secs > 600 to let some time. 1 - two server / client in replicate mode 2 - First server 10.98.98.1 is configuration server 3 - run gluster on two servers as : /usr/local/sbin/glusterfsd --log-level=DEBUG --log-file=/tmpsafe/server.log -N -f /etc/glusterfs/glusterfs-server.vol /usr/local/sbin/glusterfs --log-level=DEBUG --log-file=/tmpsafe/client.log -N -s 10.98.98.1 /mnt/vdisk/ 4 - now on 10.98.98.1, do a ifconfig eth0 down. 5 - on 10.98.98.10, after a little timeout, ls /mnt/vdisk comes back ( using 10.98.98.10 as server ) 6 - on 10.98.98.1 , ls /mnt/vdisk hangs forever 7 - on 10.98.98.1 , kill glusterfs client, rerun glusterfs , then ls /mnt/vdisk reworks again ( using 10.98.98.1 as server ) during 6 , there's no log on server and client on 10.98.98.1 show log, Regards, Nicolas Prochazka. ----------------------------------------------- #This file is auto generated, not edit ( Nicolas Prochazka Sep 2009) # ------------- Create Brick blade definition volume 10.98.98.1 type protocol/client option transport-type tcp/client option remote-host 10.98.98.1 option transport.socket.nodelay on option remote-subvolume brick end-volume volume 10.98.98.10 type protocol/client option transport-type tcp/client option remote-host 10.98.98.10 option transport.socket.nodelay on option remote-subvolume brick end-volume # ------------- Create Brick Replicate definition # ------------- Create Distribute definition volume last type cluster/distribute subvolumes 10.98.98.1 10.98.98.10 end-volume volume iothreads type performance/io-threads option thread-count 8 subvolumes last end-volume volume io-cache type performance/io-cache option cache-size 2GB # default is 32MB option cache-timeout 5 # default is 1 subvolumes iothreads end-volume volume writebehind type performance/write-behind option cache-size 4MB subvolumes io-cache end-volume DEV-10.98.98.1:~# cat /etc/glusterfs/glusterfs-server.vol volume brickless type storage/posix option directory /mnt/disks/export end-volume volume brickthread type features/locks subvolumes brickless end-volume volume brickcache type performance/io-cache option cache-size 2GB # default is 32MB option cache-timeout 2 # default is 1 subvolumes brickthread end-volume volume brick type performance/io-threads option thread-count 8 subvolumes brickcache end-volume volume server type protocol/server subvolumes brick option client-volume-filename /etc/glusterfs/Gglusterfs-client.vol option transport-type tcp option transport.socket.nodelay on option verify-volfile-checksum no option auth.addr.brick.allow 10.98.98.* end-volume Log of client on 10.98.98.10 , all seems to be ok. [2010-03-29 12:48:04] E [client-protocol.c:415:client_ping_timer_expired] 10.98.98.1: Server 10.98.98.1:6996 has not responded in the last 42 seconds, disconnecting. [2010-03-29 12:48:04] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.1: forced unwinding frame type(1) op(STATFS) [2010-03-29 12:48:04] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.1: forced unwinding frame type(2) op(PING) [2010-03-29 12:48:04] D [client-protocol.c:537:client_ping_cbk] 10.98.98.1: timer must have expired [2010-03-29 12:48:04] N [client-protocol.c:6994:notify] 10.98.98.1: disconnected [2010-03-29 12:48:06] E [socket.c:762:socket_connect_finish] 10.98.98.1: connection to 10.98.98.1:6996 failed (No route to host) [2010-03-29 12:48:09] E [socket.c:762:socket_connect_finish] 10.98.98.1: connection to 10.98.98.1:6996 failed (No route to host) log on 10.98.98.1 [2010-03-29 16:30:17] D [dht-diskusage.c:71:dht_du_info_cbk] last: on subvolume '10.98.98.1': avail_percent is: 99.00 and avail_space is: 15069396992 [2010-03-29 16:30:17] N [client-protocol.c:6246:client_setvolume_cbk] 10.98.98.1: Connected to 10.98.98.1:6996, attached to remote volume 'brick'. [2010-03-29 16:30:17] N [client-protocol.c:6246:client_setvolume_cbk] 10.98.98.10: Connected to 10.98.98.10:6996, attached to remote volume 'brick'. [2010-03-29 16:30:17] N [client-protocol.c:6246:client_setvolume_cbk] 10.98.98.10: Connected to 10.98.98.10:6996, attached to remote volume 'brick'. [2010-03-29 16:30:17] D [dht-diskusage.c:71:dht_du_info_cbk] last: on subvolume '10.98.98.1': avail_percent is: 99.00 and avail_space is: 15069396992 [2010-03-29 16:30:17] D [dht-diskusage.c:71:dht_du_info_cbk] last: on subvolume '10.98.98.10': avail_percent is: 99.00 and avail_space is: 88316628992 [2010-03-29 16:30:17] D [dht-diskusage.c:71:dht_du_info_cbk] last: on subvolume '10.98.98.10': avail_percent is: 99.00 and avail_space is: 88316628992 [2010-03-29 16:30:21] D [dht-layout.c:576:dht_layout_normalize] last: found anomalies in /iso. holes=1 overlaps=0 [2010-03-29 16:30:21] D [dht-common.c:164:dht_lookup_dir_cbk] last: fixing assignment on /iso [2010-03-29 16:30:21] D [dht-layout.c:576:dht_layout_normalize] last: found anomalies in /ha. holes=1 overlaps=0 [2010-03-29 16:30:21] D [dht-common.c:164:dht_lookup_dir_cbk] last: fixing assignment on /ha [2010-03-29 16:30:21] D [dht-layout.c:576:dht_layout_normalize] last: found anomalies in /monitoring. holes=1 overlaps=0 [2010-03-29 16:30:21] D [dht-common.c:164:dht_lookup_dir_cbk] last: fixing assignment on /monitoring nothing during hang restart [2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(LOOKUP) [2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS) [2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(LOOKUP) [2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS) [2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS) [2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS) [2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(LOOKUP) [2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS) [2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(LOOKUP) [2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS) [2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(LOOKUP) [2010-03-29 16:58:26] D [socket.c:1326:socket_submit] 10.98.98.10: not connected (priv->connected = 255) [2010-03-29 16:58:26] D [dht-common.c:1590:dht_fd_cbk] last: subvolume 10.98.98.10 returned -1 (Transport endpoint is not connected) [2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS) [2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS) [2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS) [2010-03-29 16:58:26] D [dht-common.c:1590:dht_fd_cbk] last: subvolume 10.98.98.10 returned -1 (Transport endpoint is not connected) [2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS) [2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS) [2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS) [2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS) [2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS) [2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS) [2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS) [2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS) [2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS) [2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS) [2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS) [2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS) [2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS) [2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS) [2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS) [2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS) [2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS) [2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS) [2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS) [2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS) [2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(1) op(STATFS) [2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10: forced unwinding frame type(2) op(PING) [2010-03-29 16:58:26] D [client-protocol.c:537:client_ping_cbk] 10.98.98.10: timer must have expired [2010-03-29 16:58:29] E [socket.c:762:socket_connect_finish] 10.98.98.10: connection to 10.98.98.10:6996 failed (No route to host)
_______________________________________________ Gluster-devel mailing list [email protected] http://lists.nongnu.org/mailman/listinfo/gluster-devel
