Re: [Gluster-users] v3.6.1 vs v3.5.2 self heal - help (Nagios related)
Thank you for that information. Are there plans to restore the previous functionality in a later release of 3.6.x? Or is this what we should expect going forward? On Thu, Nov 20, 2014 at 11:24 PM, Anuradha Talur ata...@redhat.com wrote: - Original Message - From: Joe Julian j...@julianfamily.org To: Anuradha Talur ata...@redhat.com, Vince Loschiavo vloschi...@gmail.com Cc: gluster-users@gluster.org Gluster-users@gluster.org Sent: Friday, November 21, 2014 12:06:27 PM Subject: Re: [Gluster-users] v3.6.1 vs v3.5.2 self heal - help (Nagios related) On November 20, 2014 10:01:45 PM PST, Anuradha Talur ata...@redhat.com wrote: - Original Message - From: Vince Loschiavo vloschi...@gmail.com To: gluster-users@gluster.org Gluster-users@gluster.org Sent: Wednesday, November 19, 2014 9:50:50 PM Subject: [Gluster-users] v3.6.1 vs v3.5.2 self heal - help (Nagios related) Hello Gluster Community, I have been using the Nagios monitoring scripts, mentioned in the below thread, on 3.5.2 with great success. The most useful of these is the self heal. However, I've just upgraded to 3.6.1 on the lab and the self heal daemon has become quite aggressive. I continually get alerts/warnings on 3.6.1 that virt disk images need self heal, then they clear. This is not the case on 3.5.2. This Configuration: 2 node, 2 brick replicated volume with 2x1GB LAG network between the peers using this volume as a QEMU/KVM virt image store through the fuse mount on Centos 6.5. Example: on 3.5.2: gluster volume heal volumename info: shows the bricks and number of entries to be healed: 0 On v3.5.2 - During normal gluster operations, I can run this command over and over again, 2-4 times per second, and it will always show 0 entries to be healed. I've used this as an indicator that the bricks are synchronized. Last night, I upgraded to 3.6.1 in lab and I'm seeing different behavior. Running gluster volume heal volumename info , during normal operations, will show a file out-of-sync, seemingly between every block written to disk then synced to the peer. I can run the command over and over again, 2-4 times per second, and it will almost always show something out of sync. The individual files change, meaning: Example: 1st Run: shows file1 out of sync 2nd run: shows file 2 and file 3 out of sync but file 1 is now in sync (not in the list) 3rd run: shows file 3 and file 4 out of sync but file 1 and 2 are in sync (not in the list). ... nth run: shows 0 files out of sync nth+1 run: shows file 3 and 12 out of sync. From looking at the virtual machines running off this gluster volume, it's obvious that gluster is working well. However, this obviously plays havoc with Nagios and alerts. Nagios will run the heal info and get different and non-useful results each time, and will send alerts. Is this behavior change (3.5.2 vs 3.6.1) expected? Is there a way to tune the settings or change the monitoring method to get better results into Nagios. In 3.6.1 the way heal info command works is different from that in 3.5.2. In 3.6.1, it is self-heal daemon that gathers the entries that might need healing. Currently, in 3.6.1, there isn't a method to distinguish between a file that is being healed and a file with on-going I/O while listing. Hence you see files with normal operation too listed in the output of heal info command. How did that regression pass?! Test cases to check this condition was not written in regression tests. -- Thanks, Anuradha. -- -Vince Loschiavo ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] v3.6.1 vs v3.5.2 self heal - help (Nagios related)
On 11/22/2014 10:12 PM, Vince Loschiavo wrote: Thank you for that information. Are there plans to restore the previous functionality in a later release of 3.6.x? Or is this what we should expect going forward? Yes it will definitely be fixed. Wait for the next release. Things should be fine. Pranith On Thu, Nov 20, 2014 at 11:24 PM, Anuradha Talur ata...@redhat.com mailto:ata...@redhat.com wrote: - Original Message - From: Joe Julian j...@julianfamily.org mailto:j...@julianfamily.org To: Anuradha Talur ata...@redhat.com mailto:ata...@redhat.com, Vince Loschiavo vloschi...@gmail.com mailto:vloschi...@gmail.com Cc: gluster-users@gluster.org mailto:gluster-users@gluster.org Gluster-users@gluster.org mailto:Gluster-users@gluster.org Sent: Friday, November 21, 2014 12:06:27 PM Subject: Re: [Gluster-users] v3.6.1 vs v3.5.2 self heal - help (Nagios related) On November 20, 2014 10:01:45 PM PST, Anuradha Talur ata...@redhat.com mailto:ata...@redhat.com wrote: - Original Message - From: Vince Loschiavo vloschi...@gmail.com mailto:vloschi...@gmail.com To: gluster-users@gluster.org mailto:gluster-users@gluster.org Gluster-users@gluster.org mailto:Gluster-users@gluster.org Sent: Wednesday, November 19, 2014 9:50:50 PM Subject: [Gluster-users] v3.6.1 vs v3.5.2 self heal - help (Nagios related) Hello Gluster Community, I have been using the Nagios monitoring scripts, mentioned in the below thread, on 3.5.2 with great success. The most useful of these is the self heal. However, I've just upgraded to 3.6.1 on the lab and the self heal daemon has become quite aggressive. I continually get alerts/warnings on 3.6.1 that virt disk images need self heal, then they clear. This is not the case on 3.5.2. This Configuration: 2 node, 2 brick replicated volume with 2x1GB LAG network between the peers using this volume as a QEMU/KVM virt image store through the fuse mount on Centos 6.5. Example: on 3.5.2: gluster volume heal volumename info: shows the bricks and number of entries to be healed: 0 On v3.5.2 - During normal gluster operations, I can run this command over and over again, 2-4 times per second, and it will always show 0 entries to be healed. I've used this as an indicator that the bricks are synchronized. Last night, I upgraded to 3.6.1 in lab and I'm seeing different behavior. Running gluster volume heal volumename info , during normal operations, will show a file out-of-sync, seemingly between every block written to disk then synced to the peer. I can run the command over and over again, 2-4 times per second, and it will almost always show something out of sync. The individual files change, meaning: Example: 1st Run: shows file1 out of sync 2nd run: shows file 2 and file 3 out of sync but file 1 is now in sync (not in the list) 3rd run: shows file 3 and file 4 out of sync but file 1 and 2 are in sync (not in the list). ... nth run: shows 0 files out of sync nth+1 run: shows file 3 and 12 out of sync. From looking at the virtual machines running off this gluster volume, it's obvious that gluster is working well. However, this obviously plays havoc with Nagios and alerts. Nagios will run the heal info and get different and non-useful results each time, and will send alerts. Is this behavior change (3.5.2 vs 3.6.1) expected? Is there a way to tune the settings or change the monitoring method to get better results into Nagios. In 3.6.1 the way heal info command works is different from that in 3.5.2. In 3.6.1, it is self-heal daemon that gathers the entries that might need healing. Currently, in 3.6.1, there isn't a method to distinguish between a file that is being healed and a file with on-going I/O while listing. Hence you see files with normal operation too listed in the output of heal info command. How did that regression pass?! Test cases to check this condition was not written in regression tests. -- Thanks, Anuradha. -- -Vince Loschiavo ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] v3.6.1 vs v3.5.2 self heal - help (Nagios related)
- Original Message - From: Vince Loschiavo vloschi...@gmail.com To: gluster-users@gluster.org Gluster-users@gluster.org Sent: Wednesday, November 19, 2014 9:50:50 PM Subject: [Gluster-users] v3.6.1 vs v3.5.2 self heal - help (Nagios related) Hello Gluster Community, I have been using the Nagios monitoring scripts, mentioned in the below thread, on 3.5.2 with great success. The most useful of these is the self heal. However, I've just upgraded to 3.6.1 on the lab and the self heal daemon has become quite aggressive. I continually get alerts/warnings on 3.6.1 that virt disk images need self heal, then they clear. This is not the case on 3.5.2. This Configuration: 2 node, 2 brick replicated volume with 2x1GB LAG network between the peers using this volume as a QEMU/KVM virt image store through the fuse mount on Centos 6.5. Example: on 3.5.2: gluster volume heal volumename info: shows the bricks and number of entries to be healed: 0 On v3.5.2 - During normal gluster operations, I can run this command over and over again, 2-4 times per second, and it will always show 0 entries to be healed. I've used this as an indicator that the bricks are synchronized. Last night, I upgraded to 3.6.1 in lab and I'm seeing different behavior. Running gluster volume heal volumename info , during normal operations, will show a file out-of-sync, seemingly between every block written to disk then synced to the peer. I can run the command over and over again, 2-4 times per second, and it will almost always show something out of sync. The individual files change, meaning: Example: 1st Run: shows file1 out of sync 2nd run: shows file 2 and file 3 out of sync but file 1 is now in sync (not in the list) 3rd run: shows file 3 and file 4 out of sync but file 1 and 2 are in sync (not in the list). ... nth run: shows 0 files out of sync nth+1 run: shows file 3 and 12 out of sync. From looking at the virtual machines running off this gluster volume, it's obvious that gluster is working well. However, this obviously plays havoc with Nagios and alerts. Nagios will run the heal info and get different and non-useful results each time, and will send alerts. Is this behavior change (3.5.2 vs 3.6.1) expected? Is there a way to tune the settings or change the monitoring method to get better results into Nagios. In 3.6.1 the way heal info command works is different from that in 3.5.2. In 3.6.1, it is self-heal daemon that gathers the entries that might need healing. Currently, in 3.6.1, there isn't a method to distinguish between a file that is being healed and a file with on-going I/O while listing. Hence you see files with normal operation too listed in the output of heal info command. Thank you, -- -Vince Loschiavo On Wed, Nov 19, 2014 at 4:35 AM, Humble Devassy Chirammal humble.deva...@gmail.com wrote: Hi Gopu, Awesome !! We can have a Gluster blog about this implementation. --Humble --Humble On Wed, Nov 19, 2014 at 5:38 PM, Gopu Krishnan gopukrishnan...@gmail.com wrote: Thanks for all your help... I was able to configure nagios using the glusterfs plugin. Following link shows how I configured it. Hope it helps someone else.: http://gopukrish.wordpress.com/2014/11/16/monitor-glusterfs-using-nagios-plugin/ On Sun, Nov 16, 2014 at 11:44 AM, Humble Devassy Chirammal humble.deva...@gmail.com wrote: Hi, Please look at this thread http://gluster.org/pipermail/gluster-users.old/2014-June/017819.html Btw, if you are around, we have a talk on same topic in upcoming GlusterFS India meetup. Details can be fetched from: http://www.meetup.com/glusterfs-India/ --Humble --Humble On Sun, Nov 16, 2014 at 11:23 AM, Gopu Krishnan gopukrishnan...@gmail.com wrote: How can we monitor the glusters and alert us if something happened wrong. I found some nagios plugins and didn't work until this time. I am still experimenting with those. Any suggestions would be much helpful ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users -- Thanks, Anuradha. ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] v3.6.1 vs v3.5.2 self heal - help (Nagios related)
On November 20, 2014 10:01:45 PM PST, Anuradha Talur ata...@redhat.com wrote: - Original Message - From: Vince Loschiavo vloschi...@gmail.com To: gluster-users@gluster.org Gluster-users@gluster.org Sent: Wednesday, November 19, 2014 9:50:50 PM Subject: [Gluster-users] v3.6.1 vs v3.5.2 self heal - help (Nagios related) Hello Gluster Community, I have been using the Nagios monitoring scripts, mentioned in the below thread, on 3.5.2 with great success. The most useful of these is the self heal. However, I've just upgraded to 3.6.1 on the lab and the self heal daemon has become quite aggressive. I continually get alerts/warnings on 3.6.1 that virt disk images need self heal, then they clear. This is not the case on 3.5.2. This Configuration: 2 node, 2 brick replicated volume with 2x1GB LAG network between the peers using this volume as a QEMU/KVM virt image store through the fuse mount on Centos 6.5. Example: on 3.5.2: gluster volume heal volumename info: shows the bricks and number of entries to be healed: 0 On v3.5.2 - During normal gluster operations, I can run this command over and over again, 2-4 times per second, and it will always show 0 entries to be healed. I've used this as an indicator that the bricks are synchronized. Last night, I upgraded to 3.6.1 in lab and I'm seeing different behavior. Running gluster volume heal volumename info , during normal operations, will show a file out-of-sync, seemingly between every block written to disk then synced to the peer. I can run the command over and over again, 2-4 times per second, and it will almost always show something out of sync. The individual files change, meaning: Example: 1st Run: shows file1 out of sync 2nd run: shows file 2 and file 3 out of sync but file 1 is now in sync (not in the list) 3rd run: shows file 3 and file 4 out of sync but file 1 and 2 are in sync (not in the list). ... nth run: shows 0 files out of sync nth+1 run: shows file 3 and 12 out of sync. From looking at the virtual machines running off this gluster volume, it's obvious that gluster is working well. However, this obviously plays havoc with Nagios and alerts. Nagios will run the heal info and get different and non-useful results each time, and will send alerts. Is this behavior change (3.5.2 vs 3.6.1) expected? Is there a way to tune the settings or change the monitoring method to get better results into Nagios. In 3.6.1 the way heal info command works is different from that in 3.5.2. In 3.6.1, it is self-heal daemon that gathers the entries that might need healing. Currently, in 3.6.1, there isn't a method to distinguish between a file that is being healed and a file with on-going I/O while listing. Hence you see files with normal operation too listed in the output of heal info command. How did that regression pass?! ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] v3.6.1 vs v3.5.2 self heal - help (Nagios related)
- Original Message - From: Joe Julian j...@julianfamily.org To: Anuradha Talur ata...@redhat.com, Vince Loschiavo vloschi...@gmail.com Cc: gluster-users@gluster.org Gluster-users@gluster.org Sent: Friday, November 21, 2014 12:06:27 PM Subject: Re: [Gluster-users] v3.6.1 vs v3.5.2 self heal - help (Nagios related) On November 20, 2014 10:01:45 PM PST, Anuradha Talur ata...@redhat.com wrote: - Original Message - From: Vince Loschiavo vloschi...@gmail.com To: gluster-users@gluster.org Gluster-users@gluster.org Sent: Wednesday, November 19, 2014 9:50:50 PM Subject: [Gluster-users] v3.6.1 vs v3.5.2 self heal - help (Nagios related) Hello Gluster Community, I have been using the Nagios monitoring scripts, mentioned in the below thread, on 3.5.2 with great success. The most useful of these is the self heal. However, I've just upgraded to 3.6.1 on the lab and the self heal daemon has become quite aggressive. I continually get alerts/warnings on 3.6.1 that virt disk images need self heal, then they clear. This is not the case on 3.5.2. This Configuration: 2 node, 2 brick replicated volume with 2x1GB LAG network between the peers using this volume as a QEMU/KVM virt image store through the fuse mount on Centos 6.5. Example: on 3.5.2: gluster volume heal volumename info: shows the bricks and number of entries to be healed: 0 On v3.5.2 - During normal gluster operations, I can run this command over and over again, 2-4 times per second, and it will always show 0 entries to be healed. I've used this as an indicator that the bricks are synchronized. Last night, I upgraded to 3.6.1 in lab and I'm seeing different behavior. Running gluster volume heal volumename info , during normal operations, will show a file out-of-sync, seemingly between every block written to disk then synced to the peer. I can run the command over and over again, 2-4 times per second, and it will almost always show something out of sync. The individual files change, meaning: Example: 1st Run: shows file1 out of sync 2nd run: shows file 2 and file 3 out of sync but file 1 is now in sync (not in the list) 3rd run: shows file 3 and file 4 out of sync but file 1 and 2 are in sync (not in the list). ... nth run: shows 0 files out of sync nth+1 run: shows file 3 and 12 out of sync. From looking at the virtual machines running off this gluster volume, it's obvious that gluster is working well. However, this obviously plays havoc with Nagios and alerts. Nagios will run the heal info and get different and non-useful results each time, and will send alerts. Is this behavior change (3.5.2 vs 3.6.1) expected? Is there a way to tune the settings or change the monitoring method to get better results into Nagios. In 3.6.1 the way heal info command works is different from that in 3.5.2. In 3.6.1, it is self-heal daemon that gathers the entries that might need healing. Currently, in 3.6.1, there isn't a method to distinguish between a file that is being healed and a file with on-going I/O while listing. Hence you see files with normal operation too listed in the output of heal info command. How did that regression pass?! Test cases to check this condition was not written in regression tests. -- Thanks, Anuradha. ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] v3.6.1 vs v3.5.2 self heal - help (Nagios related)
Hi Vince, It could be a behavioural change in heal process output capture with latest GlusterFS. If that is the case, we may tune the interval which nagios collect heal info output or some other settings to avoid continuous alerts. I am Ccing gluster nagios devs. --Humble --Humble On Wed, Nov 19, 2014 at 9:50 PM, Vince Loschiavo vloschi...@gmail.com wrote: Hello Gluster Community, I have been using the Nagios monitoring scripts, mentioned in the below thread, on 3.5.2 with great success. The most useful of these is the self heal. However, I've just upgraded to 3.6.1 on the lab and the self heal daemon has become quite aggressive. I continually get alerts/warnings on 3.6.1 that virt disk images need self heal, then they clear. This is not the case on 3.5.2. This Configuration: 2 node, 2 brick replicated volume with 2x1GB LAG network between the peers using this volume as a QEMU/KVM virt image store through the fuse mount on Centos 6.5. Example: on 3.5.2: *gluster volume heal volumename info: *shows the bricks and number of entries to be healed: 0 On v3.5.2 - During normal gluster operations, I can run this command over and over again, 2-4 times per second, and it will always show 0 entries to be healed. I've used this as an indicator that the bricks are synchronized. Last night, I upgraded to 3.6.1 in lab and I'm seeing different behavior. Running *gluster volume heal volumename info*, during normal operations, will show a file out-of-sync, seemingly between every block written to disk then synced to the peer. I can run the command over and over again, 2-4 times per second, and it will almost always show something out of sync. The individual files change, meaning: Example: 1st Run: shows file1 out of sync 2nd run: shows file 2 and file 3 out of sync but file 1 is now in sync (not in the list) 3rd run: shows file 3 and file 4 out of sync but file 1 and 2 are in sync (not in the list). ... nth run: shows 0 files out of sync nth+1 run: shows file 3 and 12 out of sync. From looking at the virtual machines running off this gluster volume, it's obvious that gluster is working well. However, this obviously plays havoc with Nagios and alerts. Nagios will run the heal info and get different and non-useful results each time, and will send alerts. Is this behavior change (3.5.2 vs 3.6.1) expected? Is there a way to tune the settings or change the monitoring method to get better results into Nagios. Thank you, -- -Vince Loschiavo On Wed, Nov 19, 2014 at 4:35 AM, Humble Devassy Chirammal humble.deva...@gmail.com wrote: Hi Gopu, Awesome !! We can have a Gluster blog about this implementation. --Humble --Humble On Wed, Nov 19, 2014 at 5:38 PM, Gopu Krishnan gopukrishnan...@gmail.com wrote: Thanks for all your help... I was able to configure nagios using the glusterfs plugin. Following link shows how I configured it. Hope it helps someone else.: http://gopukrish.wordpress.com/2014/11/16/monitor-glusterfs-using-nagios-plugin/ On Sun, Nov 16, 2014 at 11:44 AM, Humble Devassy Chirammal humble.deva...@gmail.com wrote: Hi, Please look at this thread http://gluster.org/pipermail/gluster-users.old/2014-June/017819.html Btw, if you are around, we have a talk on same topic in upcoming GlusterFS India meetup. Details can be fetched from: http://www.meetup.com/glusterfs-India/ --Humble --Humble On Sun, Nov 16, 2014 at 11:23 AM, Gopu Krishnan gopukrishnan...@gmail.com wrote: How can we monitor the glusters and alert us if something happened wrong. I found some nagios plugins and didn't work until this time. I am still experimenting with those. Any suggestions would be much helpful ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] v3.6.1 vs v3.5.2 self heal - help (Nagios related)
Hi Vince, Are you referring the monitoring scripts mentioned in the blog( http://gopukrish.wordpress.com/2014/11/16/monitor-glusterfs-using-nagios-plugin/) or the scripts part of the gluster(http://gluster.org/pipermail/gluster-users.old/2014-June/017819.html)? Please confirm? Thanks, Nishanth - Original Message - From: Humble Devassy Chirammal humble.deva...@gmail.com To: Vince Loschiavo vloschi...@gmail.com Cc: gluster-users@gluster.org Gluster-users@gluster.org, Sahina Bose sab...@redhat.com, ntho...@redhat.com Sent: Wednesday, November 19, 2014 11:22:18 PM Subject: Re: [Gluster-users] v3.6.1 vs v3.5.2 self heal - help (Nagios related) Hi Vince, It could be a behavioural change in heal process output capture with latest GlusterFS. If that is the case, we may tune the interval which nagios collect heal info output or some other settings to avoid continuous alerts. I am Ccing gluster nagios devs. --Humble --Humble On Wed, Nov 19, 2014 at 9:50 PM, Vince Loschiavo vloschi...@gmail.com wrote: Hello Gluster Community, I have been using the Nagios monitoring scripts, mentioned in the below thread, on 3.5.2 with great success. The most useful of these is the self heal. However, I've just upgraded to 3.6.1 on the lab and the self heal daemon has become quite aggressive. I continually get alerts/warnings on 3.6.1 that virt disk images need self heal, then they clear. This is not the case on 3.5.2. This Configuration: 2 node, 2 brick replicated volume with 2x1GB LAG network between the peers using this volume as a QEMU/KVM virt image store through the fuse mount on Centos 6.5. Example: on 3.5.2: *gluster volume heal volumename info: *shows the bricks and number of entries to be healed: 0 On v3.5.2 - During normal gluster operations, I can run this command over and over again, 2-4 times per second, and it will always show 0 entries to be healed. I've used this as an indicator that the bricks are synchronized. Last night, I upgraded to 3.6.1 in lab and I'm seeing different behavior. Running *gluster volume heal volumename info*, during normal operations, will show a file out-of-sync, seemingly between every block written to disk then synced to the peer. I can run the command over and over again, 2-4 times per second, and it will almost always show something out of sync. The individual files change, meaning: Example: 1st Run: shows file1 out of sync 2nd run: shows file 2 and file 3 out of sync but file 1 is now in sync (not in the list) 3rd run: shows file 3 and file 4 out of sync but file 1 and 2 are in sync (not in the list). ... nth run: shows 0 files out of sync nth+1 run: shows file 3 and 12 out of sync. From looking at the virtual machines running off this gluster volume, it's obvious that gluster is working well. However, this obviously plays havoc with Nagios and alerts. Nagios will run the heal info and get different and non-useful results each time, and will send alerts. Is this behavior change (3.5.2 vs 3.6.1) expected? Is there a way to tune the settings or change the monitoring method to get better results into Nagios. Thank you, -- -Vince Loschiavo On Wed, Nov 19, 2014 at 4:35 AM, Humble Devassy Chirammal humble.deva...@gmail.com wrote: Hi Gopu, Awesome !! We can have a Gluster blog about this implementation. --Humble --Humble On Wed, Nov 19, 2014 at 5:38 PM, Gopu Krishnan gopukrishnan...@gmail.com wrote: Thanks for all your help... I was able to configure nagios using the glusterfs plugin. Following link shows how I configured it. Hope it helps someone else.: http://gopukrish.wordpress.com/2014/11/16/monitor-glusterfs-using-nagios-plugin/ On Sun, Nov 16, 2014 at 11:44 AM, Humble Devassy Chirammal humble.deva...@gmail.com wrote: Hi, Please look at this thread http://gluster.org/pipermail/gluster-users.old/2014-June/017819.html Btw, if you are around, we have a talk on same topic in upcoming GlusterFS India meetup. Details can be fetched from: http://www.meetup.com/glusterfs-India/ --Humble --Humble On Sun, Nov 16, 2014 at 11:23 AM, Gopu Krishnan gopukrishnan...@gmail.com wrote: How can we monitor the glusters and alert us if something happened wrong. I found some nagios plugins and didn't work until this time. I am still experimenting with those. Any suggestions would be much helpful ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users ___ Gluster-users
Re: [Gluster-users] v3.6.1 vs v3.5.2 self heal - help (Nagios related)
Thank you! I think we may need some sort of dampening method and more specific input into Nagios. i.e. Details on which files are out-of-sync, versus just the number of files out-of-sync. I'm using these: http://download.gluster.org/pub/gluster/glusterfs-nagios/ On Wed, Nov 19, 2014 at 10:14 AM, Nishanth Thomas ntho...@redhat.com wrote: Hi Vince, Are you referring the monitoring scripts mentioned in the blog( http://gopukrish.wordpress.com/2014/11/16/monitor-glusterfs-using-nagios-plugin/) or the scripts part of the gluster( http://gluster.org/pipermail/gluster-users.old/2014-June/017819.html)? Please confirm? Thanks, Nishanth - Original Message - From: Humble Devassy Chirammal humble.deva...@gmail.com To: Vince Loschiavo vloschi...@gmail.com Cc: gluster-users@gluster.org Gluster-users@gluster.org, Sahina Bose sab...@redhat.com, ntho...@redhat.com Sent: Wednesday, November 19, 2014 11:22:18 PM Subject: Re: [Gluster-users] v3.6.1 vs v3.5.2 self heal - help (Nagios related) Hi Vince, It could be a behavioural change in heal process output capture with latest GlusterFS. If that is the case, we may tune the interval which nagios collect heal info output or some other settings to avoid continuous alerts. I am Ccing gluster nagios devs. --Humble --Humble On Wed, Nov 19, 2014 at 9:50 PM, Vince Loschiavo vloschi...@gmail.com wrote: Hello Gluster Community, I have been using the Nagios monitoring scripts, mentioned in the below thread, on 3.5.2 with great success. The most useful of these is the self heal. However, I've just upgraded to 3.6.1 on the lab and the self heal daemon has become quite aggressive. I continually get alerts/warnings on 3.6.1 that virt disk images need self heal, then they clear. This is not the case on 3.5.2. This Configuration: 2 node, 2 brick replicated volume with 2x1GB LAG network between the peers using this volume as a QEMU/KVM virt image store through the fuse mount on Centos 6.5. Example: on 3.5.2: *gluster volume heal volumename info: *shows the bricks and number of entries to be healed: 0 On v3.5.2 - During normal gluster operations, I can run this command over and over again, 2-4 times per second, and it will always show 0 entries to be healed. I've used this as an indicator that the bricks are synchronized. Last night, I upgraded to 3.6.1 in lab and I'm seeing different behavior. Running *gluster volume heal volumename info*, during normal operations, will show a file out-of-sync, seemingly between every block written to disk then synced to the peer. I can run the command over and over again, 2-4 times per second, and it will almost always show something out of sync. The individual files change, meaning: Example: 1st Run: shows file1 out of sync 2nd run: shows file 2 and file 3 out of sync but file 1 is now in sync (not in the list) 3rd run: shows file 3 and file 4 out of sync but file 1 and 2 are in sync (not in the list). ... nth run: shows 0 files out of sync nth+1 run: shows file 3 and 12 out of sync. From looking at the virtual machines running off this gluster volume, it's obvious that gluster is working well. However, this obviously plays havoc with Nagios and alerts. Nagios will run the heal info and get different and non-useful results each time, and will send alerts. Is this behavior change (3.5.2 vs 3.6.1) expected? Is there a way to tune the settings or change the monitoring method to get better results into Nagios. Thank you, -- -Vince Loschiavo On Wed, Nov 19, 2014 at 4:35 AM, Humble Devassy Chirammal humble.deva...@gmail.com wrote: Hi Gopu, Awesome !! We can have a Gluster blog about this implementation. --Humble --Humble On Wed, Nov 19, 2014 at 5:38 PM, Gopu Krishnan gopukrishnan...@gmail.com wrote: Thanks for all your help... I was able to configure nagios using the glusterfs plugin. Following link shows how I configured it. Hope it helps someone else.: http://gopukrish.wordpress.com/2014/11/16/monitor-glusterfs-using-nagios-plugin/ On Sun, Nov 16, 2014 at 11:44 AM, Humble Devassy Chirammal humble.deva...@gmail.com wrote: Hi, Please look at this thread http://gluster.org/pipermail/gluster-users.old/2014-June/017819.html Btw, if you are around, we have a talk on same topic in upcoming GlusterFS India meetup. Details can be fetched from: http://www.meetup.com/glusterfs-India/ --Humble --Humble On Sun, Nov 16, 2014 at 11:23 AM, Gopu Krishnan gopukrishnan...@gmail.com wrote: How can we monitor the glusters and alert us if something happened wrong. I found some nagios plugins and didn't work until this time. I am still experimenting with those. Any suggestions would be much helpful ___ Gluster-users mailing list Gluster
Re: [Gluster-users] v3.6.1 vs v3.5.2 self heal - help (Nagios related)
Hi Vince, Thank you for the quick response. For the time being to reduce the frequency the frequency of the alerts, please check whether flapping enabled. If not, please go ahead enable it. It will suppress the alerts if there is a frequent change in the service status. Currently the plugin checks for the number of un-synced entries and if it is grater than 0, change the state of the service which send the alert. Probably this part requires a change and we may have to introduce some thresholds using which decision can be taken whether to change the state of the service or not. Regarding why the upgrade causing the files to go out of sync, someone else needs to answer. Thanks, Nishanth - Original Message - From: Vince Loschiavo vloschi...@gmail.com To: Nishanth Thomas ntho...@redhat.com Cc: Humble Devassy Chirammal humble.deva...@gmail.com, gluster-users@gluster.org Gluster-users@gluster.org, Sahina Bose sab...@redhat.com Sent: Wednesday, November 19, 2014 11:46:28 PM Subject: Re: [Gluster-users] v3.6.1 vs v3.5.2 self heal - help (Nagios related) Thank you! I think we may need some sort of dampening method and more specific input into Nagios. i.e. Details on which files are out-of-sync, versus just the number of files out-of-sync. I'm using these: http://download.gluster.org/pub/gluster/glusterfs-nagios/ On Wed, Nov 19, 2014 at 10:14 AM, Nishanth Thomas ntho...@redhat.com wrote: Hi Vince, Are you referring the monitoring scripts mentioned in the blog( http://gopukrish.wordpress.com/2014/11/16/monitor-glusterfs-using-nagios-plugin/) or the scripts part of the gluster( http://gluster.org/pipermail/gluster-users.old/2014-June/017819.html)? Please confirm? Thanks, Nishanth - Original Message - From: Humble Devassy Chirammal humble.deva...@gmail.com To: Vince Loschiavo vloschi...@gmail.com Cc: gluster-users@gluster.org Gluster-users@gluster.org, Sahina Bose sab...@redhat.com, ntho...@redhat.com Sent: Wednesday, November 19, 2014 11:22:18 PM Subject: Re: [Gluster-users] v3.6.1 vs v3.5.2 self heal - help (Nagios related) Hi Vince, It could be a behavioural change in heal process output capture with latest GlusterFS. If that is the case, we may tune the interval which nagios collect heal info output or some other settings to avoid continuous alerts. I am Ccing gluster nagios devs. --Humble --Humble On Wed, Nov 19, 2014 at 9:50 PM, Vince Loschiavo vloschi...@gmail.com wrote: Hello Gluster Community, I have been using the Nagios monitoring scripts, mentioned in the below thread, on 3.5.2 with great success. The most useful of these is the self heal. However, I've just upgraded to 3.6.1 on the lab and the self heal daemon has become quite aggressive. I continually get alerts/warnings on 3.6.1 that virt disk images need self heal, then they clear. This is not the case on 3.5.2. This Configuration: 2 node, 2 brick replicated volume with 2x1GB LAG network between the peers using this volume as a QEMU/KVM virt image store through the fuse mount on Centos 6.5. Example: on 3.5.2: *gluster volume heal volumename info: *shows the bricks and number of entries to be healed: 0 On v3.5.2 - During normal gluster operations, I can run this command over and over again, 2-4 times per second, and it will always show 0 entries to be healed. I've used this as an indicator that the bricks are synchronized. Last night, I upgraded to 3.6.1 in lab and I'm seeing different behavior. Running *gluster volume heal volumename info*, during normal operations, will show a file out-of-sync, seemingly between every block written to disk then synced to the peer. I can run the command over and over again, 2-4 times per second, and it will almost always show something out of sync. The individual files change, meaning: Example: 1st Run: shows file1 out of sync 2nd run: shows file 2 and file 3 out of sync but file 1 is now in sync (not in the list) 3rd run: shows file 3 and file 4 out of sync but file 1 and 2 are in sync (not in the list). ... nth run: shows 0 files out of sync nth+1 run: shows file 3 and 12 out of sync. From looking at the virtual machines running off this gluster volume, it's obvious that gluster is working well. However, this obviously plays havoc with Nagios and alerts. Nagios will run the heal info and get different and non-useful results each time, and will send alerts. Is this behavior change (3.5.2 vs 3.6.1) expected? Is there a way to tune the settings or change the monitoring method to get better results into Nagios. Thank you, -- -Vince Loschiavo On Wed, Nov 19, 2014 at 4:35 AM, Humble Devassy Chirammal humble.deva...@gmail.com wrote: Hi Gopu, Awesome !! We can have a Gluster blog about this implementation. --Humble --Humble On Wed, Nov 19, 2014 at 5:38 PM, Gopu Krishnan