Andrew, I just want to throw something out there,
since I think you're being rather harsh toward the gluster community.
I've found that the developers are very dedicated
and very motivated to solve problems and debug the product.
However, if you're not paying them for support,
your right to complain about quick response times
and prioritizing YOUR specific issues diminishes
severely. If you felt you weren't getting proper
responses form the gluster-users, it's likely
that none of us had an easy answer for you, or
that maybe no one was really interested in
solving that problem because they may not have
seen a parallel to their problems (such is the
nature of open-source and community support groups).
If you need to use something in a mission
critical production environment, for heavens sake
PAY FOR SUPPORT!. you can't operate mission
critical applications on a "ask anyone around and
hope they can help me in time" basis.
The developers are very responsive to the list,
but they also have other work to do they get to
take time off of work, and the one(s) who can
solve your specific issue may not be checking the
list every minute of every day. (personally, I'm
glad, cause they should be working on the
product, not sifting around waiting for people to submit problems to them).
Again, pay fo support, you'll get dedicated
resources to solve your problem. don't pay and
we as a community will be happy to help when and if we can.
my .02,
Keith
At 06:27 AM 3/18/2009, Andrew McGill wrote:
Replying to myself, since nobody else did: I'm
outta here for now. Whether by network errors
or software errors, or plain old stupidity, I
can longer maintain a stable glusterfs
mount. Due to the batch nature of the
application I'm running (rdiff-backup), this is
a fatal failure. I've now decommissioned my
failing glusterfs installation. It's a pity
though - although it was horribly slow because
of inappropriate hardware, it did work (until it
flaked out fatally). For decommissioning, it
turns out that having your data stored without
metadata is a good design choice. Currently
glusterfs needs some work in terms of recovering
from network and server errors. (I think that
the developers should be made to run it over
10Mbps ethernet hubs for a month.) Currently
glusterfs is a high capacity (given appropriate
hardware), but not a high availability solution.
On Saturday 14 March 2009 19:12:03 Andrew McGill
wrote: > Hello, > > I upgraded from
glusterfs-1.3.12.tar.gz to
glusterfs-2.0.0rc4.tar.gz > because I could not
complete a rdiff-backup without inexplicable
errors. > My efforts have been rewarded with a
crash, for which some logs are > displayed
below. > > The backend is unify, with multiple
afr subvolumes of two ext3 volumes > each. > >
Here is how the client side of glusterfs
died: > > 2009-03-14 13:32:15 W
[afr-self-heal-data.c:798:afr_sh_data_fix]
afr4: > Picking favorite child u100-rs1 as
authentic source to resolve conflicting > data
of >
/backup5/robbie.foo.co.za/rdiff-backup-data/mirror_metadata.2009-03-14T08:0
>0:17+02:00.snapshot.gz 2009-03-14 13:32:15 W >
[afr-self-heal-data.c:646:afr_sh_data_open_cbk]
afr4: sourcing file >
/backup5/robbie.foo.co.za/rdiff-backup-data/mirror_meta
> data.2009-03-14T08:00:17+02:00.snapshot.gz
from u100-rs1 to other sinks > 2009-03-14
13:32:22 E [socket.c:102:__socket_rwv] u50-dcc1:
readv failed > (Bad address) > 2009-03-14
13:32:22 E
[socket.c:634:__socket_proto_state_machine]
u50-dcc1: > read (Bad address) in state 3
(192.168.227.65:6996) > 2009-03-14 13:32:22 E
[saved-frames.c:169:saved_frames_unwind]
u50-dcc1: > forced unwinding frame type(1)
op(READ) > 2009-03-14 13:32:22 E
[socket.c:102:__socket_rwv] u50-dr1: readv
failed > (Bad address) > 2009-03-14 13:32:22 E
[socket.c:634:__socket_proto_state_machine]
u50-dr1: > read (Bad address) in state 3
(192.168.227.31:6996) > 2009-03-14 13:32:22 E
[saved-frames.c:169:saved_frames_unwind]
u50-dr1: > forced unwinding frame type(1)
op(READ) > 2009-03-14 13:32:22 E
[fuse-bridge.c:1548:fuse_readv_cbk]
glusterfs-fuse: > 5998294: READ => -1 (Transport
endpoint is not connected) > 2009-03-14 13:33:03
E [socket.c:102:__socket_rwv] u50-dr2: readv
failed > (Bad address) > 2009-03-14 13:33:03 E
[socket.c:634:__socket_proto_state_machine]
u50-dr2: > read (Bad address) in state 3
(192.168.227.32:6996) > 2009-03-14 13:33:03 E
[saved-frames.c:169:saved_frames_unwind]
u50-dr2: > forced unwinding frame type(1)
op(READ) > 2009-03-14 13:33:03 E
[socket.c:102:__socket_rwv] u50-rs3: readv
failed > (Bad address) > 2009-03-14 13:33:03 E
[socket.c:634:__socket_proto_state_machine]
u50-rs3: > read (Bad address) in state 3
(192.168.227.59:6996) > 2009-03-14 13:33:03 E
[saved-frames.c:169:saved_frames_unwind]
u50-rs3: > forced unwinding frame type(1)
op(READ) > 2009-03-14 13:33:03 E
[fuse-bridge.c:1548:fuse_readv_cbk]
glusterfs-fuse: > 6006118: READ => -1 (Transport
endpoint is not connected) > 2009-03-14 13:33:03
E [fuse-bridge.c:1548:fuse_readv_cbk]
glusterfs-fuse: > 6006119: READ => -1 (Transport
endpoint is not connected) > 2009-03-14 13:33:03
E [fuse-bridge.c:1548:fuse_readv_cbk]
glusterfs-fuse: > 6006120: READ => -1 (Transport
endpoint is not connected) > 2009-03-14 13:33:03
E [fuse-bridge.c:1548:fuse_readv_cbk]
glusterfs-fuse: > 6006121: READ => -1 (Transport
endpoint is not connected) > pending frames: >
æ¹» > patchset:
cb602a1d7d41587c24379cb2636961ab91446f86 + >
signal received: 6 > configuration details:argp
1 > backtrace 1 > dlfcn 1 > fdatasync 1 >
libpthread 1 > llistxattr 1 > setfsid 1 >
spinlock 1 > epoll.h 1 > xattr.h 1 >
st_atim.tv_nsec 1 > package-string: glusterfs
2.0.0rc4 > [0x381420] >
/lib/libc.so.6(abort+0x101)[0xb86451] >
/usr/lib/glusterfs/2.0.0rc4/xlator/mount/fuse.so[0x54b9a8]
> /lib/libpthread.so.0[0xd302db] >
/lib/libc.so.6(clone+0x5e)[0xc2912e] >
--------- > > > On the server side, the
following messages don't enlighten me, but do >
remind me that there was another client running
version 1.13 still > connecting. It looks like
the server just noticed that the client
died. > > 2009-03-14 13:30:03 E
[socket.c:583:__socket_proto_state_machine]
server: > socket header validate failed
(192.168.227.167:1023). possible mismatch
of > transport-type between server and client
volumes, or version mismatch > 2009-03-14
13:30:03 N [server-protocol.c:8048:notify]
server: > 192.168.227.167:1023 disconnected >
2009-03-14 13:31:45 E
[socket.c:463:__socket_proto_validate_header]
server: > socket header signature does not match
:O (42.6c.6f) > 2009-03-14 13:31:45 E
[socket.c:583:__socket_proto_state_machine]
server: > socket header validate failed
(192.168.227.167:1023). possible mismatch
of > transport-type between server and client
volumes, or version mismatch > 2009-03-14
13:31:45 N [server-protocol.c:8048:notify]
server: > 192.168.227.167:1023 disconnected >
2009-03-14 13:32:22 E
[socket.c:102:__socket_rwv] server: readv
failed > (Connection reset by peer) > 2009-03-14
13:32:22 E
[socket.c:561:__socket_proto_state_machine]
server: > read (Connection reset by peer) in
state 1 (192.168.227.5:1020) > 2009-03-14
13:32:22 N [server-protocol.c:8048:notify]
server: > 192.168.227.5:1020 disconnected >
2009-03-14 13:32:22 N
[server-protocol.c:7295:mop_setvolume] server: >
accepted client from 192.168.227.5:1020 >
2009-03-14 13:35:48 N
[server-protocol.c:8048:notify] server: >
192.168.227.5:1017 disconnected > 2009-03-14
13:35:48 N [server-protocol.c:8048:notify]
server: > 192.168.227.5:1020 disconnected >
2009-03-14 13:35:48 N
[server-helpers.c:515:server_connection_destroy]
> server: destroyed connection of
backup5.foo.com-23205-2009/03/14-07:10: >
52:777008-u50-dcc1 > > > On another server
brick, the 25Gb volume u50-dr1-raw was full (it
should > have been 50Gb like its peer). As I
recall, the free space of the second > volume of
AFR which does not get checked for disk space (a
bug, IMHO). > > It said this, which could have
led to the client-side failure a few minutes >
later (the clocks are in sync): > > 2009-03-14
13:30:23 W [posix.c:773:posix_mkdir]
u50-dr1-raw: mkdir > of
/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-backup.tmp.1:
No > space left on device > 2009-03-14 13:30:23
E [server-protocol.c:3478:server_stub_resume]
server: > 1109657: INODELK
(/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-
> backup.tmp.1) on u50-dr1 returning error: -1
(2) > 2009-03-14 13:30:23 E
[server-protocol.c:3478:server_stub_resume]
server: > 1109658: INODELK
(/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-
> backup.tmp.1) on u50-dr1 returning error: -1
(2) > 2009-03-14 13:30:23 E
[server-protocol.c:3448:server_stub_resume]
server: > 3184942: ENTRYLK
(/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-
> backup.tmp.1) on u50-dr1 for key <nul>
returning error: -1 (2) > 2009-03-14 13:30:23 E
[server-protocol.c:3448:server_stub_resume]
server: > 3184943: ENTRYLK
(/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-
> backup.tmp.1) on u50-dr1 for key <nul>
returning error: -1 (2) > 2009-03-14 13:30:23 E
[server-protocol.c:3448:server_stub_resume]
server: > 3184947: ENTRYLK
(/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-
> backup.tmp.1) on u50-dr1 for key hl returning
error: -1 (2) > 2009-03-14 13:30:23 E
[server-protocol.c:3448:server_stub_resume]
server: > 3184949: ENTRYLK
(/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-
> backup.tmp.1) on u50-dr1 for key hl returning
error: -1 (2) > 2009-03-14 13:30:23 E
[server-protocol.c:3478:server_stub_resume]
server: > 1109660: INODELK
(/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-
> backup.tmp.1/hl) on u50-dr1 returning error:
-1 (2) > 2009-03-14 13:30:23 E
[server-protocol.c:3478:server_stub_resume]
server: > 1109661: INODELK
(/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-
> backup.tmp.1/hl) on u50-dr1 returning error:
-1 (2) > 2009-03-14 13:30:23 E
[server-protocol.c:3448:server_stub_resume]
server: > 3184952: ENTRYLK
(/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-
> backup.tmp.1/hl) on u50-dr1 for key <nul>
returning error: -1 (2) > 2009-03-14 13:30:23 E
[server-protocol.c:3448:server_stub_resume]
server: > 3184953: ENTRYLK
(/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-
> backup.tmp.1/hl) on u50-dr1 for key <nul>
returning error: -1 (2) > 2009-03-14 13:30:23 E
[server-protocol.c:2774:server_stub_resume]
server: > 1109663: XATTROP
(/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-
> backup.tmp.1) on u50-dr1 returning error: -1
(2) > 2009-03-14 13:30:23 E
[server-protocol.c:2868:server_stub_resume]
server: > 1109665: RMDIR
(/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-ba
> ckup.tmp.1) on u50-dr1 returning error: -1
(2) > 2009-03-14 13:31:45 E
[socket.c:463:__socket_proto_validate_header]
server: > socket header signature does not match
:O (42.6c.6f) > 2009-03-14 13:31:45 E
[socket.c:583:__socket_proto_state_machine]
server: > socket header validate failed
(192.168.227.167:1022). possible mismatch
of > transport-type between server and client
volumes, or version mismatch > 2009-03-14
13:31:45 N [server-protocol.c:8048:notify]
server: > 192.168.227.167:1022 disconnected >
2009-03-14 13:32:22 E
[socket.c:102:__socket_rwv] server: readv
failed > (Connection reset by peer) > 2009-03-14
13:32:22 E
[socket.c:561:__socket_proto_state_machine]
server: > read (Connection reset by peer) in
state 1 (192.168.227.5:1016) > 2009-03-14
13:32:22 N [server-protocol.c:8048:notify]
server: > 192.168.227.5:1016 disconnected >
2009-03-14 13:32:23 N
[server-protocol.c:7295:mop_setvolume] server: >
accepted client from 192.168.227.5:1016 > > > I
may have to move the backup in question off
glusterfs (if I can just find > the space
somewhere), since it has taken 4 days to realise
that the backing > up is not just slow, but
faulty. (Of course, if I can't fix it, I'll win
a > trip to the data center to install a new
machine to replace the system.) > >
_______________________________________________ >
Gluster-users mailing list >
[email protected] >
http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
[email protected]
http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
[email protected]
http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users