[Gluster-users] Gluster 2.6 and infiniband
Hello, i have a problem with gluster 3.2.6 and infiniband. With gluster 3.3 its working ok but with 3.2.6 i have following problems: when i'm trying to mount rdma volume using command mount -t glusterfs 192.168.100.1:/atlas1.rdma mount i get: [2012-06-07 04:30:18.894337] I [glusterfsd.c:1493:main] 0-/usr/local/sbin/glusterfs: Started running /usr/local/sbin/glusterfs version 3.2.6 [2012-06-07 04:30:18.907499] E [glusterfsd-mgmt.c:628:mgmt_getspec_cbk] 0-glusterfs: failed to get the 'volume file' from server [2012-06-07 04:30:18.907592] E [glusterfsd-mgmt.c:695:mgmt_getspec_cbk] 0-mgmt: failed to fetch volume file (key:/atlas1.rdma) [2012-06-07 04:30:18.907995] W [glusterfsd.c:727:cleanup_and_exit] (--/usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0xc9) [0x7f784e2c8bc9] (--/usr/local/ lib/libgfrpc.so.0(rpc_clnt_handle_reply+0xa5) [0x7f784e2c8975] (--/usr/local/sbin/glusterfs(mgmt_getspec_cbk+0x28b) [0x40861b]))) 0-: received signum (0) , shutting down [2012-06-07 04:30:18.908049] I [fuse-bridge.c:3727:fini] 0-fuse: Unmounting 'mount'. Same command without .rdma works ok. thanks Matus ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Gluster 2.6 and infiniband
On 06/07/2012 02:04 PM, bxma...@gmail.com wrote: Hello, i have a problem with gluster 3.2.6 and infiniband. With gluster 3.3 its working ok but with 3.2.6 i have following problems: when i'm trying to mount rdma volume using command mount -t glusterfs 192.168.100.1:/atlas1.rdma mount i get: [2012-06-07 04:30:18.894337] I [glusterfsd.c:1493:main] 0-/usr/local/sbin/glusterfs: Started running /usr/local/sbin/glusterfs version 3.2.6 [2012-06-07 04:30:18.907499] E [glusterfsd-mgmt.c:628:mgmt_getspec_cbk] 0-glusterfs: failed to get the 'volume file' from server [2012-06-07 04:30:18.907592] E [glusterfsd-mgmt.c:695:mgmt_getspec_cbk] 0-mgmt: failed to fetch volume file (key:/atlas1.rdma) [2012-06-07 04:30:18.907995] W [glusterfsd.c:727:cleanup_and_exit] (--/usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0xc9) [0x7f784e2c8bc9] (--/usr/local/ lib/libgfrpc.so.0(rpc_clnt_handle_reply+0xa5) [0x7f784e2c8975] (--/usr/local/sbin/glusterfs(mgmt_getspec_cbk+0x28b) [0x40861b]))) 0-: received signum (0) , shutting down [2012-06-07 04:30:18.908049] I [fuse-bridge.c:3727:fini] 0-fuse: Unmounting 'mount'. Same command without .rdma works ok. Is the volume's transport type only 'rdma' ? or 'tcp,rdma' ? If its only 'rdma', then appending .rdma to volume name is not required. The appending of .rdma is only required when there are both type of transports on a volume (ie, 'tcp,rdma'), as from the client you can decide which transport you want to mount. default volume name would point to 'tcp' transport type, and appending .rdma, will point to rdma transport type. Hope that is clear now. Regards, Amar ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Gluster 2.6 and infiniband
Hello, at first it was tcp then tcp,rdma. You are right that without tcp definition .rdma is not working. But now i have another problem. I'm trying tcp / rdma, im trying even tcp/rdma using normal network card ( not using infiniband IP but normal 1gbit network card and i have still same speed, upload about 30mb/s and download about 200mb/s .. so i'm not sure if rdma is even working. Native infiniband is giving me 3500mb/s speed with benchmark tests (ib_rdma_bw ). thanks Matus 2012/6/7 Amar Tumballi ama...@redhat.com: On 06/07/2012 02:04 PM, bxma...@gmail.com wrote: Hello, i have a problem with gluster 3.2.6 and infiniband. With gluster 3.3 its working ok but with 3.2.6 i have following problems: when i'm trying to mount rdma volume using command mount -t glusterfs 192.168.100.1:/atlas1.rdma mount i get: [2012-06-07 04:30:18.894337] I [glusterfsd.c:1493:main] 0-/usr/local/sbin/glusterfs: Started running /usr/local/sbin/glusterfs version 3.2.6 [2012-06-07 04:30:18.907499] E [glusterfsd-mgmt.c:628:mgmt_getspec_cbk] 0-glusterfs: failed to get the 'volume file' from server [2012-06-07 04:30:18.907592] E [glusterfsd-mgmt.c:695:mgmt_getspec_cbk] 0-mgmt: failed to fetch volume file (key:/atlas1.rdma) [2012-06-07 04:30:18.907995] W [glusterfsd.c:727:cleanup_and_exit] (--/usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0xc9) [0x7f784e2c8bc9] (--/usr/local/ lib/libgfrpc.so.0(rpc_clnt_handle_reply+0xa5) [0x7f784e2c8975] (--/usr/local/sbin/glusterfs(mgmt_getspec_cbk+0x28b) [0x40861b]))) 0-: received signum (0) , shutting down [2012-06-07 04:30:18.908049] I [fuse-bridge.c:3727:fini] 0-fuse: Unmounting 'mount'. Same command without .rdma works ok. Is the volume's transport type only 'rdma' ? or 'tcp,rdma' ? If its only 'rdma', then appending .rdma to volume name is not required. The appending of .rdma is only required when there are both type of transports on a volume (ie, 'tcp,rdma'), as from the client you can decide which transport you want to mount. default volume name would point to 'tcp' transport type, and appending .rdma, will point to rdma transport type. Hope that is clear now. Regards, Amar ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Performance optimization tips Gluster 3.3? (small files / directory listings)
Hi, Sorry this reply won't be of any help to your problem, but I am too curious to understand how it can be even slower if monting using Gluster client which I would expect always be quicker than NFS or anything else. If you find the reason port it back to the list and share with us please. I think this directory index issues has been reported already for systems with many files. Regards, Fernando From: gluster-users-boun...@gluster.org [mailto:gluster-users-boun...@gluster.org] On Behalf Of olav johansen Sent: 07 June 2012 03:32 To: gluster-users@gluster.org Subject: [Gluster-users] Performance optimization tips Gluster 3.3? (small files / directory listings) Hi, I'm using Gluster 3.3.0-1.el6.x86_64, on two storage nodes, replicated mode (fs1, fs2) Node specs: CentOS 6.2 Intel Quad Core 2.8Ghz, 4Gb ram, 3ware raid, 2x500GB sata 7200rpm (RAID1 for os), 6x1TB sata 7200rpm (RAID10 for /data), 1Gbit network I've it mounted data partition to web1 a Dual Quad 2.8Ghz, 8Gb ram, using glusterfs. (also tried NFS - Gluster mount) We have 50Gb of files, ~800'000 files in 3 levels of directories (max 2000 directories in one folder) My main problem is speed of directory indexes ls -alR on the gluster mount takes 23 minutes every time. It don't seem like any directory listing information cache, with regular NFS (not gluster) between web1-fs1, this takes 6m13s first time, and 5m13s there after. Gluster mount is 4+ times slower for directory indexing performance vs pure NFS to single server, is this as expected? I understand there is a lot more calls involved checking both nodes but I'm just looking for a reality check regarding this. Any suggestions of how I can speed this up? Thanks, ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Performance optimization tips Gluster 3.3? (small files / directory listings)
On Thu, Jun 07, 2012 at 10:10:03AM +, Fernando Frediani (Qube) wrote: Sorry this reply won’t be of any help to your problem, but I am too curious to understand how it can be even slower if monting using Gluster client which I would expect always be quicker than NFS or anything else. (1) Try it with ls -aR or find . instead of ls -alR (2) Try it on a gluster non-replicated volume (for fair comparison with direct NFS access) With a replicated volume, many accesses involve sending queries to *both* servers to check they are in sync - even read accesses. This in turn can cause disk seeks on both machines, so the latency you'll get is the larger of the two. If you are doing lots of accesses sequentially then the latencies will all add up. A stat() is one of those accesses which touches both machines, and ls -l forces a stat() of each file found. In fact, a quick test suggests ls -l does stat, lstat, getxattr and lgetxattr: $ ls -laR . /dev/null 2ert; cut -f1 -d'(' ert | sort | uniq -c 13 access 1 arch_prctl 5 brk 395 close 4 connect 1 execve 1 exit_group 2 fcntl 391 fstat 3 futex 702 getdents 1 getrlimit 1719 getxattr 3 ioctl 1721 lgetxattr 9 lseek 1721 lstat 58 mmap 24 mprotect 12 munmap 424 open 19 read 2 readlink 2 rt_sigaction 1 rt_sigprocmask 1 set_robust_list 1 set_tid_address 4 socket 1719 stat 1 statfs 29 write Looking at the detail in the strace output, I see these are actually lstat(target-file, ...) lgetxattr(target-file, security.selinux, ...) getxattr(target-file, system.posix_acl_access, ...) stat(/etc/localtime, ...) Compare without -l: $ strace ls -aR . /dev/null 2ert; cut -f1 -d'(' ert | sort | uniq -c 9 access 1 arch_prctl 4 brk 377 close 1 execve 1 exit_group 1 fcntl 376 fstat 3 futex 702 getdents 1 getrlimit 3 ioctl 39 mmap 16 mprotect 4 munmap 388 open 11 read 2 rt_sigaction 1 rt_sigprocmask 1 set_robust_list 1 set_tid_address 1 stat 1 statfs 9 write Regards, Brian. ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Performance optimization tips Gluster 3.3? (small files / directory listings)
Hello there. That's really interesting, because we think about using GlusterFS too with a similar setup/scenario. I read about a really strange setup with GlusterFS native client mount on the web servers and NFS mount on top of that so you get GlusterFS failover + NFS caching. Can't find the link right now. - Original Message - From: olav johansen luxis2...@gmail.com To: gluster-users@gluster.org Sent: Thursday, June 7, 2012 8:02:14 AM Subject: [Gluster-users] Performance optimization tips Gluster 3.3? (small files / directory listings) Hi, I'm using Gluster 3.3.0-1.el6.x86_64, on two storage nodes, replicated mode (fs1, fs2) Node specs: CentOS 6.2 Intel Quad Core 2.8Ghz, 4Gb ram, 3ware raid, 2x500GB sata 7200rpm (RAID1 for os), 6x1TB sata 7200rpm (RAID10 for /data), 1Gbit network I've it mounted data partition to web1 a Dual Quad 2.8Ghz, 8Gb ram, using glusterfs. (also tried NFS - Gluster mount) We have 50Gb of files, ~800'000 files in 3 levels of directories (max 2000 directories in one folder) My main problem is speed of directory indexes ls -alR on the gluster mount takes 23 minutes every time. It don't seem like any directory listing information cache, with regular NFS (not gluster) between web1-fs1, this takes 6m13s first time, and 5m13s there after. Gluster mount is 4+ times slower for directory indexing performance vs pure NFS to single server, is this as expected? I understand there is a lot more calls involved checking both nodes but I'm just looking for a reality check regarding this. Any suggestions of how I can speed this up? Thanks, ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Performance optimization tips Gluster 3.3? (small files / directory listings)
Here's the link: http://community.gluster.org/a/nfs-performance-with-fuse-client-redundancy/ Sent again with a reply to all. Gerald - Original Message - From: Christian Meisinger em_got...@gmx.net To: olav johansen luxis2...@gmail.com Cc: gluster-users@gluster.org Sent: Thursday, June 7, 2012 7:00:14 AM Subject: Re: [Gluster-users] Performance optimization tips Gluster3.3? (small files / directory listings) Hello there. That's really interesting, because we think about using GlusterFS too with a similar setup/scenario. I read about a really strange setup with GlusterFS native client mount on the web servers and NFS mount on top of that so you get GlusterFS failover + NFS caching. Can't find the link right now. - Original Message - From: olav johansen luxis2...@gmail.com To: gluster-users@gluster.org Sent: Thursday, June 7, 2012 8:02:14 AM Subject: [Gluster-users] Performance optimization tips Gluster 3.3? (small files / directory listings) Hi, I'm using Gluster 3.3.0-1.el6.x86_64, on two storage nodes, replicated mode (fs1, fs2) Node specs: CentOS 6.2 Intel Quad Core 2.8Ghz, 4Gb ram, 3ware raid, 2x500GB sata 7200rpm (RAID1 for os), 6x1TB sata 7200rpm (RAID10 for /data), 1Gbit network I've it mounted data partition to web1 a Dual Quad 2.8Ghz, 8Gb ram, using glusterfs. (also tried NFS - Gluster mount) We have 50Gb of files, ~800'000 files in 3 levels of directories (max 2000 directories in one folder) My main problem is speed of directory indexes ls -alR on the gluster mount takes 23 minutes every time. It don't seem like any directory listing information cache, with regular NFS (not gluster) between web1-fs1, this takes 6m13s first time, and 5m13s there after. Gluster mount is 4+ times slower for directory indexing performance vs pure NFS to single server, is this as expected? I understand there is a lot more calls involved checking both nodes but I'm just looking for a reality check regarding this. Any suggestions of how I can speed this up? Thanks, ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Gluster 2.6 and infiniband
To make a long story short, I made rdma client connect files and mounted with them directly : #/etc/glusterd/vols/pirdist/pirdist.rdma-fuse.vol /pirdist glusterfs transport=rdma 0 0 #/etc/glusterd/vols/pirstripe/pirstripe.rdma-fuse.vol /pirstripe glusterfs transport=rdma 0 0 the transport=rdma does nothing here since it reads the parameters from .vol files . However you'll see that they're now commented out since RDMA has been very unstable for us. Servers lose their connections to each other, which somehow causes gbe clients to lose their connections. IP over IB however is working great, although at the expense of some performance vs RDMA, but it's still much better than gbe. On Thu, Jun 7, 2012 at 4:25 AM, bxma...@gmail.com bxma...@gmail.com wrote: Hello, at first it was tcp then tcp,rdma. You are right that without tcp definition .rdma is not working. But now i have another problem. I'm trying tcp / rdma, im trying even tcp/rdma using normal network card ( not using infiniband IP but normal 1gbit network card and i have still same speed, upload about 30mb/s and download about 200mb/s .. so i'm not sure if rdma is even working. Native infiniband is giving me 3500mb/s speed with benchmark tests (ib_rdma_bw ). thanks Matus 2012/6/7 Amar Tumballi ama...@redhat.com: On 06/07/2012 02:04 PM, bxma...@gmail.com wrote: Hello, i have a problem with gluster 3.2.6 and infiniband. With gluster 3.3 its working ok but with 3.2.6 i have following problems: when i'm trying to mount rdma volume using command mount -t glusterfs 192.168.100.1:/atlas1.rdma mount i get: [2012-06-07 04:30:18.894337] I [glusterfsd.c:1493:main] 0-/usr/local/sbin/glusterfs: Started running /usr/local/sbin/glusterfs version 3.2.6 [2012-06-07 04:30:18.907499] E [glusterfsd-mgmt.c:628:mgmt_getspec_cbk] 0-glusterfs: failed to get the 'volume file' from server [2012-06-07 04:30:18.907592] E [glusterfsd-mgmt.c:695:mgmt_getspec_cbk] 0-mgmt: failed to fetch volume file (key:/atlas1.rdma) [2012-06-07 04:30:18.907995] W [glusterfsd.c:727:cleanup_and_exit] (--/usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0xc9) [0x7f784e2c8bc9] (--/usr/local/ lib/libgfrpc.so.0(rpc_clnt_handle_reply+0xa5) [0x7f784e2c8975] (--/usr/local/sbin/glusterfs(mgmt_getspec_cbk+0x28b) [0x40861b]))) 0-: received signum (0) , shutting down [2012-06-07 04:30:18.908049] I [fuse-bridge.c:3727:fini] 0-fuse: Unmounting 'mount'. Same command without .rdma works ok. Is the volume's transport type only 'rdma' ? or 'tcp,rdma' ? If its only 'rdma', then appending .rdma to volume name is not required. The appending of .rdma is only required when there are both type of transports on a volume (ie, 'tcp,rdma'), as from the client you can decide which transport you want to mount. default volume name would point to 'tcp' transport type, and appending .rdma, will point to rdma transport type. Hope that is clear now. Regards, Amar ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Performance optimization tips Gluster 3.3? (small files / directory listings)
Brian, Small correction: 'sending queries to *both* servers to check they are in sync - even read accesses.' Read fops like stat/getxattr etc are sent to only one brick. Pranith. - Original Message - From: Brian Candler b.cand...@pobox.com To: Fernando Frediani (Qube) fernando.fredi...@qubenet.net Cc: olav johansen luxis2...@gmail.com, gluster-users@gluster.org gluster-users@gluster.org Sent: Thursday, June 7, 2012 4:24:37 PM Subject: Re: [Gluster-users] Performance optimization tips Gluster 3.3? (small files / directory listings) On Thu, Jun 07, 2012 at 10:10:03AM +, Fernando Frediani (Qube) wrote: Sorry this reply won’t be of any help to your problem, but I am too curious to understand how it can be even slower if monting using Gluster client which I would expect always be quicker than NFS or anything else. (1) Try it with ls -aR or find . instead of ls -alR (2) Try it on a gluster non-replicated volume (for fair comparison with direct NFS access) With a replicated volume, many accesses involve sending queries to *both* servers to check they are in sync - even read accesses. This in turn can cause disk seeks on both machines, so the latency you'll get is the larger of the two. If you are doing lots of accesses sequentially then the latencies will all add up. A stat() is one of those accesses which touches both machines, and ls -l forces a stat() of each file found. In fact, a quick test suggests ls -l does stat, lstat, getxattr and lgetxattr: $ ls -laR . /dev/null 2ert; cut -f1 -d'(' ert | sort | uniq -c 13 access 1 arch_prctl 5 brk 395 close 4 connect 1 execve 1 exit_group 2 fcntl 391 fstat 3 futex 702 getdents 1 getrlimit 1719 getxattr 3 ioctl 1721 lgetxattr 9 lseek 1721 lstat 58 mmap 24 mprotect 12 munmap 424 open 19 read 2 readlink 2 rt_sigaction 1 rt_sigprocmask 1 set_robust_list 1 set_tid_address 4 socket 1719 stat 1 statfs 29 write Looking at the detail in the strace output, I see these are actually lstat(target-file, ...) lgetxattr(target-file, security.selinux, ...) getxattr(target-file, system.posix_acl_access, ...) stat(/etc/localtime, ...) Compare without -l: $ strace ls -aR . /dev/null 2ert; cut -f1 -d'(' ert | sort | uniq -c 9 access 1 arch_prctl 4 brk 377 close 1 execve 1 exit_group 1 fcntl 376 fstat 3 futex 702 getdents 1 getrlimit 3 ioctl 39 mmap 16 mprotect 4 munmap 388 open 11 read 2 rt_sigaction 1 rt_sigprocmask 1 set_robust_list 1 set_tid_address 1 stat 1 statfs 9 write Regards, Brian. ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
[Gluster-users] Export Gluster backed volume with standard NFS
Hey everyone, I currently have an NFS server that I need to make highly available. I was thinking I would use Gluster, but since there's no way to match Gluster's built in NFS server to my current NFS exports file I can't use the Gluster NFS server. So I was thinking I could have two bricks running replicated Gluster volumes, have them mount the Gluster volume from themselves with the FUSE client which would give them automatic failover, and then use standard NFS to re-export the mounts I need. Is anyone already doing that or anything like it? Any problems or performance issues? Thanks Scot Kreienkamp skre...@la-z-boy.com This message is intended only for the individual or entity to which it is addressed. It may contain privileged, confidential information which is exempt from disclosure under applicable laws. If you are not the intended recipient, please note that you are strictly prohibited from disseminating or distributing this information (other than to the intended recipient) or copying this information. If you have received this communication in error, please notify us immediately by e-mail or by telephone at the above number. Thank you. ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Performance optimization tips Gluster 3.3? (small files / directory listings)
On Thu, Jun 07, 2012 at 08:34:56AM -0400, Pranith Kumar Karampuri wrote: Brian, Small correction: 'sending queries to *both* servers to check they are in sync - even read accesses.' Read fops like stat/getxattr etc are sent to only one brick. Is that new behaviour for 3.3? My understanding was that stat() was a healing operation. http://gluster.org/community/documentation/index.php/Gluster_3.2:_Triggering_Self-Heal_on_Replicate If this is no longer true, then I'd like to understand what happens after a node has been down and comes up again. I understand there's a self-healing daemon in 3.3, but what if you try to access a file which has not yet been healed? I'm interested in understanding this, especially the split-brain scenarios (better to understand them *before* you're stuck in a problem :-) BTW I'm in the process of building a 2-node 3.3 test cluster right now. Cheers, Brian. ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Performance optimization tips Gluster 3.3? (small files / directory listings)
hi Brian, 'stat' command comes as fop (File-operation) 'lookup' to the gluster mount which triggers self-heal. So the behavior is still same. I was referring to the fop 'stat' which will be performed only on one of the bricks. Unfortunately most of the commands and fops have same name. Following are some of the examples of read-fops: .access .stat .fstat .readlink .getxattr .fgetxattr .readv Pranith. - Original Message - From: Brian Candler b.cand...@pobox.com To: Pranith Kumar Karampuri pkara...@redhat.com Cc: olav johansen luxis2...@gmail.com, gluster-users@gluster.org, Fernando Frediani (Qube) fernando.fredi...@qubenet.net Sent: Thursday, June 7, 2012 7:06:26 PM Subject: Re: [Gluster-users] Performance optimization tips Gluster 3.3? (small files / directory listings) On Thu, Jun 07, 2012 at 08:34:56AM -0400, Pranith Kumar Karampuri wrote: Brian, Small correction: 'sending queries to *both* servers to check they are in sync - even read accesses.' Read fops like stat/getxattr etc are sent to only one brick. Is that new behaviour for 3.3? My understanding was that stat() was a healing operation. http://gluster.org/community/documentation/index.php/Gluster_3.2:_Triggering_Self-Heal_on_Replicate If this is no longer true, then I'd like to understand what happens after a node has been down and comes up again. I understand there's a self-healing daemon in 3.3, but what if you try to access a file which has not yet been healed? I'm interested in understanding this, especially the split-brain scenarios (better to understand them *before* you're stuck in a problem :-) BTW I'm in the process of building a 2-node 3.3 test cluster right now. Cheers, Brian. ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
[Gluster-users] Gluster 3.3.0 and VMware ESXi 5
Hi everybody, we are testing Gluster 3.3 as an alternative to our current Nexenta based storage. With the introduction of granular based locking gluster seems like a viable alternative for VM storage. Regrettably we cannot get it to work even for the most rudimentary tests. We have a two brick setup with two ESXi 5 servers. We created both distributed and replicated volumes. We can mount the volumes via NFS on the ESXi servers without any issues but that is as far as we can go. When we try to migrate a VM to the gluster backed datastore there is no activity on the bricks and eventually the operation times out on the ESXi side. The nfs.log shows messages like these (distributed volume): [2012-06-07 00:00:16.992649] E [nfs3.c:3551:nfs3_rmdir_resume] 0-nfs-nfsv3: Unable to resolve FH: (192.168.11.11:646) vmvol : 7d25cb9a-b9c8-440d-bbd8-973694ccad17 [2012-06-07 00:00:17.027559] W [nfs3.c:3525:nfs3svc_rmdir_cbk] 0-nfs: 3bb48d69: /TEST = -1 (Directory not empty) [2012-06-07 00:00:17.066276] W [nfs3.c:3525:nfs3svc_rmdir_cbk] 0-nfs: 3bb48d90: /TEST = -1 (Directory not empty) [2012-06-07 00:00:17.097118] E [nfs3.c:3551:nfs3_rmdir_resume] 0-nfs-nfsv3: Unable to resolve FH: (192.168.11.11:646) vmvol : ----0001 When the volume is mounted on the ESXi servers, we get messages like these in nfs.log: [2012-06-06 23:57:34.697460] W [socket.c:195:__socket_rwv] 0-socket.nfs-server: readv failed (Connection reset by peer) The same volumes mounted via NFS on a linux box work fine and we did a couple of benchmarks with bonnie++ with very promising results. Curiously, if we ssh into the ESXi boxes and go to the mount point of the volume, we can see it contents and write. Any clues of what might be going on? Thanks in advance. Cheers, Atha ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
[Gluster-users] Issue recreating volumes
Here are a couple of wrinkles I have come across while trying gluster 3.3.0 under ubuntu-12.04. (1) At one point I decided to delete some volumes and recreate them. But it would not let me recreate them: root@dev-storage2:~# gluster volume create fast dev-storage1:/disk/storage1/fast dev-storage2:/disk/storage2/fast /disk/storage2/fast or a prefix of it is already part of a volume This is even though gluster volume info showed no volumes. Restarting glusterd didn't help either. Nor indeed did a complete reinstall of glusterfs, even with apt-get remove --purge and rm -rf'ing the state directories. Digging around, I found some hidden state files: # ls -l /disk/storage1/*/.glusterfs/00/00 /disk/storage1/fast/.glusterfs/00/00: total 0 lrwxrwxrwx 1 root root 8 Jun 7 14:23 ----0001 - ../../.. /disk/storage1/safe/.glusterfs/00/00: total 0 lrwxrwxrwx 1 root root 8 Jun 7 14:21 ----0001 - ../../.. I deleted them on both machines: rm -rf /disk/*/.glusterfs Problem solved? No, not even with glusterd restart :-( root@dev-storage2:~# gluster volume create safe replica 2 dev-storage1:/disk/storage1/safe dev-storage2:/disk/storage2/safe /disk/storage2/safe or a prefix of it is already part of a volume In the end, what I needed was to delete the actual data bricks themselves: rm -rf /disk/*/fast rm -rf /disk/*/safe That allowed me to recreate the volumes. This is probably an understanding/documentation issue. I'm sure there's a lot of magic going on in the gluster 3.3 internals (is that long ID some sort of replica update sequence number?) which if it were fully documented would make it easier to recover from these situations. (2) Minor point: the FUSE client no longer seems to understand or need the _netdev option, however it still invokes it if you use defaults in /etc/fstab, and so you get a warning about an unknown option: root@dev-storage1:~# grep gluster /etc/fstab storage1:/safe /gluster/safe glusterfs defaults,nobootwait 0 0 storage1:/fast /gluster/fast glusterfs defaults,nobootwait 0 0 root@dev-storage1:~# mount /gluster/safe unknown option _netdev (ignored) Regards, Brian. ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Issue recreating volumes
Brian, The first point(1) is working as it is intended. Allowing something like that can get the volume into very complicated state. Please go through the following bug: https://bugzilla.redhat.com/show_bug.cgi?id=812214 Pranith - Original Message - From: Brian Candler b.cand...@pobox.com To: gluster-users@gluster.org Sent: Thursday, June 7, 2012 8:57:16 PM Subject: [Gluster-users] Issue recreating volumes Here are a couple of wrinkles I have come across while trying gluster 3.3.0 under ubuntu-12.04. (1) At one point I decided to delete some volumes and recreate them. But it would not let me recreate them: root@dev-storage2:~# gluster volume create fast dev-storage1:/disk/storage1/fast dev-storage2:/disk/storage2/fast /disk/storage2/fast or a prefix of it is already part of a volume This is even though gluster volume info showed no volumes. Restarting glusterd didn't help either. Nor indeed did a complete reinstall of glusterfs, even with apt-get remove --purge and rm -rf'ing the state directories. Digging around, I found some hidden state files: # ls -l /disk/storage1/*/.glusterfs/00/00 /disk/storage1/fast/.glusterfs/00/00: total 0 lrwxrwxrwx 1 root root 8 Jun 7 14:23 ----0001 - ../../.. /disk/storage1/safe/.glusterfs/00/00: total 0 lrwxrwxrwx 1 root root 8 Jun 7 14:21 ----0001 - ../../.. I deleted them on both machines: rm -rf /disk/*/.glusterfs Problem solved? No, not even with glusterd restart :-( root@dev-storage2:~# gluster volume create safe replica 2 dev-storage1:/disk/storage1/safe dev-storage2:/disk/storage2/safe /disk/storage2/safe or a prefix of it is already part of a volume In the end, what I needed was to delete the actual data bricks themselves: rm -rf /disk/*/fast rm -rf /disk/*/safe That allowed me to recreate the volumes. This is probably an understanding/documentation issue. I'm sure there's a lot of magic going on in the gluster 3.3 internals (is that long ID some sort of replica update sequence number?) which if it were fully documented would make it easier to recover from these situations. (2) Minor point: the FUSE client no longer seems to understand or need the _netdev option, however it still invokes it if you use defaults in /etc/fstab, and so you get a warning about an unknown option: root@dev-storage1:~# grep gluster /etc/fstab storage1:/safe /gluster/safe glusterfs defaults,nobootwait 0 0 storage1:/fast /gluster/fast glusterfs defaults,nobootwait 0 0 root@dev-storage1:~# mount /gluster/safe unknown option _netdev (ignored) Regards, Brian. ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Gluster 3.3.0 and VMware ESXi 5
Hi Atha, I have a very similar setup and behaviour here. I have two bricks with replication and I am able to mount the NFS, deploy a machine there, but when I try to Power it On it simply doesn't work and gives a different message saying that it couldn't find some files. I wonder if anyone actually got it working with VMware ESXi and can share with us their scenario setup. Here I have two CentOS 6.2 and Gluster 3.3.0. Fernando -Original Message- From: gluster-users-boun...@gluster.org [mailto:gluster-users-boun...@gluster.org] On Behalf Of Atha Kouroussis Sent: 07 June 2012 15:29 To: gluster-users@gluster.org Subject: [Gluster-users] Gluster 3.3.0 and VMware ESXi 5 Hi everybody, we are testing Gluster 3.3 as an alternative to our current Nexenta based storage. With the introduction of granular based locking gluster seems like a viable alternative for VM storage. Regrettably we cannot get it to work even for the most rudimentary tests. We have a two brick setup with two ESXi 5 servers. We created both distributed and replicated volumes. We can mount the volumes via NFS on the ESXi servers without any issues but that is as far as we can go. When we try to migrate a VM to the gluster backed datastore there is no activity on the bricks and eventually the operation times out on the ESXi side. The nfs.log shows messages like these (distributed volume): [2012-06-07 00:00:16.992649] E [nfs3.c:3551:nfs3_rmdir_resume] 0-nfs-nfsv3: Unable to resolve FH: (192.168.11.11:646) vmvol : 7d25cb9a-b9c8-440d-bbd8-973694ccad17 [2012-06-07 00:00:17.027559] W [nfs3.c:3525:nfs3svc_rmdir_cbk] 0-nfs: 3bb48d69: /TEST = -1 (Directory not empty) [2012-06-07 00:00:17.066276] W [nfs3.c:3525:nfs3svc_rmdir_cbk] 0-nfs: 3bb48d90: /TEST = -1 (Directory not empty) [2012-06-07 00:00:17.097118] E [nfs3.c:3551:nfs3_rmdir_resume] 0-nfs-nfsv3: Unable to resolve FH: (192.168.11.11:646) vmvol : ----0001 When the volume is mounted on the ESXi servers, we get messages like these in nfs.log: [2012-06-06 23:57:34.697460] W [socket.c:195:__socket_rwv] 0-socket.nfs-server: readv failed (Connection reset by peer) The same volumes mounted via NFS on a linux box work fine and we did a couple of benchmarks with bonnie++ with very promising results. Curiously, if we ssh into the ESXi boxes and go to the mount point of the volume, we can see it contents and write. Any clues of what might be going on? Thanks in advance. Cheers, Atha ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
[Gluster-users] Suggestions for Gluster 3.4
It was said in previous emails about suggestions on how to improve Gluster on the development of the next version, 3.4. Well I guess we can all put up a list and see what will be more popular and useful to most people then send to the developers for consideration. My list starts with: RAID 1E type cluster (Numbers of nodes don't need to be multiple of the either 'replicate' or 'stripe' count. Can grow the cluster adding a single node.) Server x Brick awareness ( Avoids replicate data on two bricks running on the same server. Very useful when having multiple logical drives under the same RAID controller for improved performance.). Rack awareness (For very large clusters: Avoids replicate data on servers on the same rack) Regards, Fernando ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Troubleshooting Unified Object and File Storage in 3.3
On Wed 06 Jun 2012 10:25:38 PM PDT, Vijay Bellur wrote: On 06/07/2012 03:22 AM, Jason Brooks wrote: I've been testing on CentOS 6.2. The only command from the Admin guide I've run successfully has been: curl -v -H 'X-Storage-User: test:tester' -H 'X-Storage-Pass:testing' -k http://127.0.0.1:8080/auth/v1.0. I started out with a centos machine running gluster-swift, which I was connecting to a four node gluster cluster. It wasn't clear to me from the admin guide where I was supposed to mount my gluster volume, You will need to mount the gluster volume at /mnt/gluster-object/account. For the example in admin guide, /mnt/gluster-object/AUTH_test needs to be the mountpoint for your gluster volume. Thanks -- that helps a lot. Another Q on the admin guide. Under 12.4.4. Configuring Authentication System the guide says Proxy server must be configured to authenticate using tempauth. Is this the only supported auth method? I'm experimenting with keystone. Thanks, Jason ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Troubleshooting Unified Object and File Storage in 3.3
On 06/06/2012 10:25 PM, Vijay Bellur wrote: On 06/07/2012 03:22 AM, Jason Brooks wrote: I've been testing on CentOS 6.2. The only command from the Admin guide I've run successfully has been: curl -v -H 'X-Storage-User: test:tester' -H 'X-Storage-Pass:testing' -k http://127.0.0.1:8080/auth/v1.0. I started out with a centos machine running gluster-swift, which I was connecting to a four node gluster cluster. It wasn't clear to me from the admin guide where I was supposed to mount my gluster volume, You will need to mount the gluster volume at /mnt/gluster-object/account. For the example in admin guide, /mnt/gluster-object/AUTH_test needs to be the mountpoint for your gluster volume. There's something I'm confused about -- if I mount my gluster volume at AUTH_test, I am able to work with it, but is the idea that users should manually create a gluster volume and mountpoint for every account? I've been working through this Fedora 17 openstack howto: https://fedoraproject.org/wiki/Getting_started_with_OpenStack_on_Fedora_17#Configure_swift_with_keystone. I thought I'd bring gluster into the mix, but it's not clear to me how the setup directions I see here and elsewhere for swift ought to interact with the gluster-swift packages. The gluster-swift-plugin places a set of configuration files into /etc/swift -- the 1.conf files and the ring configurations. The admin guide doesn't mention any swift-ring-builder operations -- are these not required with UFO? Thanks, Jason ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Performance optimization tips Gluster 3.3? (small files / directory listings)
Hi All, Thanks for great feedback, I had changed ip's and I noticed one server wasn't connecting correctly when checking log. To ensure I had no wrong-doings I've re-done the bricks from scratch, clean configurations, with mount info attached below, still not performing 'great' compared to a single NFS mount. The application we're running our files don't change, we only add / delete files, so I'd like to get directory / file info cached as much as possible. Config info: gluster volume info data-storage Volume Name: data-storage Type: Replicate Volume ID: cc91c107-bdbb-4179-a097-cdd3e9d5ac93 Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: fs1:/data/storage Brick2: fs2:/data/storage gluster On my web1 node I mounted: # mount -t glusterfs fs1:/data-storage /storage I've copied over my data to it again and doing a ls several times, takes ~0.5 seconds: [@web1 files]# time ls -all|wc -l 1989 real0m0.485s user0m0.022s sys 0m0.109s [@web1 files]# time ls -all|wc -l 1989 real0m0.489s user0m0.016s sys 0m0.116s [@web1 files]# time ls -all|wc -l 1989 real0m0.493s user0m0.018s sys 0m0.115s Doing the same thing on the raw os files on one node takes 0.021s [@fs2 files]# time ls -all|wc -l 1989 real0m0.021s user0m0.007s sys 0m0.015s [@fs2 files]# time ls -all|wc -l 1989 real0m0.020s user0m0.008s sys 0m0.013s Now full directory listing even seems slower... : [@web1 files]# time ls -alR|wc -l 2242956 real74m0.660s user0m20.117s sys 1m24.734s [@web1 files]# time ls -alR|wc -l 2242956 real26m27.159s user0m17.387s sys 1m11.217s [@web1 files]# time ls -alR|wc -l 2242956 real27m38.163s user0m18.333s sys 1m19.824s Just as crazy reference, on another single server with SSD's (Raid 10) drives I get: files# time ls -alR|wc -l 2260484 real0m15.761s user0m5.170s sys 0m7.670s For the same operation. (this server even have more files...) My goal is to get this directory listing as fast as possible, I don't have the hardware/budget to test a SSD configuration, but would a SSD setup give me ~1minute directory listing time (assuming it is 4 times slower than single node)? If I added two more bricks to the cluster / replicated, would this double read speed? Thanks for any insight! storage.log from web1 on mount - [2012-06-07 20:47:45.584320] I [glusterfsd.c:1666:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.3.0 [2012-06-07 20:47:45.624548] I [io-cache.c:1549:check_cache_size_ok] 0-data-storage-quick-read: Max cache size is 8252092416 [2012-06-07 20:47:45.624612] I [io-cache.c:1549:check_cache_size_ok] 0-data-storage-io-cache: Max cache size is 8252092416 [2012-06-07 20:47:45.628148] I [client.c:2142:notify] 0-data-storage-client-0: parent translators are ready, attempting connect on transport [2012-06-07 20:47:45.631059] I [client.c:2142:notify] 0-data-storage-client-1: parent translators are ready, attempting connect on transport Given volfile: +--+ 1: volume data-storage-client-0 2: type protocol/client 3: option remote-host fs1 4: option remote-subvolume /data/storage 5: option transport-type tcp 6: end-volume 7: 8: volume data-storage-client-1 9: type protocol/client 10: option remote-host fs2 11: option remote-subvolume /data/storage 12: option transport-type tcp 13: end-volume 14: 15: volume data-storage-replicate-0 16: type cluster/replicate 17: subvolumes data-storage-client-0 data-storage-client-1 18: end-volume 19: 20: volume data-storage-write-behind 21: type performance/write-behind 22: subvolumes data-storage-replicate-0 23: end-volume 24: 25: volume data-storage-read-ahead 26: type performance/read-ahead 27: subvolumes data-storage-write-behind 28: end-volume 29: 30: volume data-storage-io-cache 31: type performance/io-cache 32: subvolumes data-storage-read-ahead 33: end-volume 34: 35: volume data-storage-quick-read 36: type performance/quick-read 37: subvolumes data-storage-io-cache 38: end-volume 39: 40: volume data-storage-md-cache 41: type performance/md-cache 42: subvolumes data-storage-quick-read 43: end-volume 44: 45: volume data-storage 46: type debug/io-stats 47: option latency-measurement off 48: option count-fop-hits off 49: subvolumes data-storage-md-cache 50: end-volume +--+ [2012-06-07 20:47:45.642625] I [rpc-clnt.c:1660:rpc_clnt_reconfig] 0-data-storage-client-0: changing port to 24009 (from 0) [2012-06-07 20:47:45.648604] I [rpc-clnt.c:1660:rpc_clnt_reconfig] 0-data-storage-client-1: changing port to 24009 (from 0) [2012-06-07 20:47:49.592729] I
Re: [Gluster-users] Gluster 3.3.0 and VMware ESXi 5
Hi Fernando, thanks for the reply. I'm seeing exactly the same behavior. I'm wondering if it somehow has to do with locking. I read here (http://community.gluster.org/q/can-not-mount-nfs-share-without-nolock-option/) that locking on NFS was not implemented in 3.2.x and it is now in 3.3. I tested 3.2.x with ESXi a few months ago and it seemed to work fine but the lack of granular locking made it a no-go back then. Anybody care to chime in with any suggestions? Is there a way to revert NFS to 3.2.x behavior to test? Cheers, Atha On Thursday, June 7, 2012 at 11:52 AM, Fernando Frediani (Qube) wrote: Hi Atha, I have a very similar setup and behaviour here. I have two bricks with replication and I am able to mount the NFS, deploy a machine there, but when I try to Power it On it simply doesn't work and gives a different message saying that it couldn't find some files. I wonder if anyone actually got it working with VMware ESXi and can share with us their scenario setup. Here I have two CentOS 6.2 and Gluster 3.3.0. Fernando -Original Message- From: gluster-users-boun...@gluster.org [mailto:gluster-users-boun...@gluster.org] On Behalf Of Atha Kouroussis Sent: 07 June 2012 15:29 To: gluster-users@gluster.org (mailto:gluster-users@gluster.org) Subject: [Gluster-users] Gluster 3.3.0 and VMware ESXi 5 Hi everybody, we are testing Gluster 3.3 as an alternative to our current Nexenta based storage. With the introduction of granular based locking gluster seems like a viable alternative for VM storage. Regrettably we cannot get it to work even for the most rudimentary tests. We have a two brick setup with two ESXi 5 servers. We created both distributed and replicated volumes. We can mount the volumes via NFS on the ESXi servers without any issues but that is as far as we can go. When we try to migrate a VM to the gluster backed datastore there is no activity on the bricks and eventually the operation times out on the ESXi side. The nfs.log shows messages like these (distributed volume): [2012-06-07 00:00:16.992649] E [nfs3.c:3551:nfs3_rmdir_resume] 0-nfs-nfsv3: Unable to resolve FH: (192.168.11.11:646) vmvol : 7d25cb9a-b9c8-440d-bbd8-973694ccad17 [2012-06-07 00:00:17.027559] W [nfs3.c:3525:nfs3svc_rmdir_cbk] 0-nfs: 3bb48d69: /TEST = -1 (Directory not empty) [2012-06-07 00:00:17.066276] W [nfs3.c:3525:nfs3svc_rmdir_cbk] 0-nfs: 3bb48d90: /TEST = -1 (Directory not empty) [2012-06-07 00:00:17.097118] E [nfs3.c:3551:nfs3_rmdir_resume] 0-nfs-nfsv3: Unable to resolve FH: (192.168.11.11:646) vmvol : ----0001 When the volume is mounted on the ESXi servers, we get messages like these in nfs.log: [2012-06-06 23:57:34.697460] W [socket.c:195:__socket_rwv] 0-socket.nfs-server: readv failed (Connection reset by peer) The same volumes mounted via NFS on a linux box work fine and we did a couple of benchmarks with bonnie++ with very promising results. Curiously, if we ssh into the ESXi boxes and go to the mount point of the volume, we can see it contents and write. Any clues of what might be going on? Thanks in advance. Cheers, Atha ___ Gluster-users mailing list Gluster-users@gluster.org (mailto:Gluster-users@gluster.org) http://gluster.org/cgi-bin/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Issue recreating volumes
Hi Brian, Answers inline. Here are a couple of wrinkles I have come across while trying gluster 3.3.0 under ubuntu-12.04. (1) At one point I decided to delete some volumes and recreate them. But it would not let me recreate them: root@dev-storage2:~# gluster volume create fast dev-storage1:/disk/storage1/fast dev-storage2:/disk/storage2/fast /disk/storage2/fast or a prefix of it is already part of a volume This is even though gluster volume info showed no volumes. Restarting glusterd didn't help either. Nor indeed did a complete reinstall of glusterfs, even with apt-get remove --purge and rm -rf'ing the state directories. Digging around, I found some hidden state files: # ls -l /disk/storage1/*/.glusterfs/00/00 /disk/storage1/fast/.glusterfs/00/00: total 0 lrwxrwxrwx 1 root root 8 Jun 7 14:23 ----0001 - ../../.. /disk/storage1/safe/.glusterfs/00/00: total 0 lrwxrwxrwx 1 root root 8 Jun 7 14:21 ----0001 - ../../.. I deleted them on both machines: rm -rf /disk/*/.glusterfs Problem solved? No, not even with glusterd restart :-( root@dev-storage2:~# gluster volume create safe replica 2 dev-storage1:/disk/storage1/safe dev-storage2:/disk/storage2/safe /disk/storage2/safe or a prefix of it is already part of a volume In the end, what I needed was to delete the actual data bricks themselves: rm -rf /disk/*/fast rm -rf /disk/*/safe That allowed me to recreate the volumes. This is probably an understanding/documentation issue. I'm sure there's a lot of magic going on in the gluster 3.3 internals (is that long ID some sort of replica update sequence number?) which if it were fully documented would make it easier to recover from these situations. Preventing of 'recreating' of a volume (actually internally, it just prevents you from 're-using' the bricks, you can create same volume name with different bricks), is very much intentional to prevent disasters (like data loss) from happening. We treat data separate from volume's config information. Hence, when a volume is 'delete'd, only the configuration details of the volume is lost, but data belonging to the volume is present on its brick as is. It is admin's discretion to handle the data later. Considering above point, now, if we allow 're-using' of the same brick which was part of some volume earlier, it could lead to issues of data placement in wrong brick, internal inode number clashes etc, which could lead to 'heal' the data from client perspective, leading to deleting some files which would be important. If admin is aware of the case, and knows that there is no 'data' inside the brick, then easier option is to delete the export dir and it gets created by 'gluster volume create'. If you want to fix it without deleting the export directory, then it is also possible, by deleting the extended attributes on the brick like below. bash# setfattr -x trusted.glusterfs.volume-id $brickdir bash# setfattr -x trusted.gfid $brickdir And now, creating the brick should succeed. (2) Minor point: the FUSE client no longer seems to understand or need the _netdev option, however it still invokes it if you use defaults in /etc/fstab, and so you get a warning about an unknown option: root@dev-storage1:~# grep gluster /etc/fstab storage1:/safe /gluster/safe glusterfs defaults,nobootwait 0 0 storage1:/fast /gluster/fast glusterfs defaults,nobootwait 0 0 root@dev-storage1:~# mount /gluster/safe unknown option _netdev (ignored) Will look into this. Regards, Amar ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
Re: [Gluster-users] Issue recreating volumes
one can use the clear_xattrs.sh script with the bricks as argument to remove all the xattrs set on bricks. it recursively deleted all xattrs from the bricks' files. after running this script on bricks, we can re-use them Regards, Rajesh Amaravathi, Software Engineer, GlusterFS RedHat Inc. - Original Message - From: Amar Tumballi ama...@redhat.com To: Brian Candler b.cand...@pobox.com Cc: gluster-users@gluster.org Sent: Friday, June 8, 2012 10:34:08 AM Subject: Re: [Gluster-users] Issue recreating volumes Hi Brian, Answers inline. Here are a couple of wrinkles I have come across while trying gluster 3.3.0 under ubuntu-12.04. (1) At one point I decided to delete some volumes and recreate them. But it would not let me recreate them: root@dev-storage2:~# gluster volume create fast dev-storage1:/disk/storage1/fast dev-storage2:/disk/storage2/fast /disk/storage2/fast or a prefix of it is already part of a volume This is even though gluster volume info showed no volumes. Restarting glusterd didn't help either. Nor indeed did a complete reinstall of glusterfs, even with apt-get remove --purge and rm -rf'ing the state directories. Digging around, I found some hidden state files: # ls -l /disk/storage1/*/.glusterfs/00/00 /disk/storage1/fast/.glusterfs/00/00: total 0 lrwxrwxrwx 1 root root 8 Jun 7 14:23 ----0001 - ../../.. /disk/storage1/safe/.glusterfs/00/00: total 0 lrwxrwxrwx 1 root root 8 Jun 7 14:21 ----0001 - ../../.. I deleted them on both machines: rm -rf /disk/*/.glusterfs Problem solved? No, not even with glusterd restart :-( root@dev-storage2:~# gluster volume create safe replica 2 dev-storage1:/disk/storage1/safe dev-storage2:/disk/storage2/safe /disk/storage2/safe or a prefix of it is already part of a volume In the end, what I needed was to delete the actual data bricks themselves: rm -rf /disk/*/fast rm -rf /disk/*/safe That allowed me to recreate the volumes. This is probably an understanding/documentation issue. I'm sure there's a lot of magic going on in the gluster 3.3 internals (is that long ID some sort of replica update sequence number?) which if it were fully documented would make it easier to recover from these situations. Preventing of 'recreating' of a volume (actually internally, it just prevents you from 're-using' the bricks, you can create same volume name with different bricks), is very much intentional to prevent disasters (like data loss) from happening. We treat data separate from volume's config information. Hence, when a volume is 'delete'd, only the configuration details of the volume is lost, but data belonging to the volume is present on its brick as is. It is admin's discretion to handle the data later. Considering above point, now, if we allow 're-using' of the same brick which was part of some volume earlier, it could lead to issues of data placement in wrong brick, internal inode number clashes etc, which could lead to 'heal' the data from client perspective, leading to deleting some files which would be important. If admin is aware of the case, and knows that there is no 'data' inside the brick, then easier option is to delete the export dir and it gets created by 'gluster volume create'. If you want to fix it without deleting the export directory, then it is also possible, by deleting the extended attributes on the brick like below. bash# setfattr -x trusted.glusterfs.volume-id $brickdir bash# setfattr -x trusted.gfid $brickdir And now, creating the brick should succeed. (2) Minor point: the FUSE client no longer seems to understand or need the _netdev option, however it still invokes it if you use defaults in /etc/fstab, and so you get a warning about an unknown option: root@dev-storage1:~# grep gluster /etc/fstab storage1:/safe /gluster/safe glusterfs defaults,nobootwait 0 0 storage1:/fast /gluster/fast glusterfs defaults,nobootwait 0 0 root@dev-storage1:~# mount /gluster/safe unknown option _netdev (ignored) Will look into this. Regards, Amar ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users