Re: [Ganglia-developers] gmond udp receive buffer errors
On 26/04/12 10:38, Ramon Bastiaans wrote: I just sent in this: * https://github.com/ganglia/monitor-core/pull/34 I changed the patch to behave as you described. See the pull request for details. Hi Ramon, Thanks for contributing this patch, I see it is already checked by Jeff so I've only had a quick glance at the code to make sure that it preserves legacy behavior and is suitable for the next 3.3.x release. It will be in the next release candidate for people to test. Regards, Daniel -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] gmond udp receive buffer errors
I just sent in this: * https://github.com/ganglia/monitor-core/pull/34 I changed the patch to behave as you described. See the pull request for details. Cheers, - Ramon. On 24-4-2012 17:47, Daniel Pocock wrote: On 24/04/12 16:51, Ramon Bastiaans wrote: On 23-4-2012 15:26, Daniel Pocock wrote: Actually, apr can be a little bit more naughty than that: for Vladimir and myself, attempting to query the buffer size from APR reports the value 0. Querying the underlying socket directly reports another value. I'm using apr-1.4.2 on Debian squeeze, which version do you have? Looking at APR's source it seems as if it only queries (on unix) if the option is set and not the actual value of the option: Great, thanks for confirming the root cause of this issue However, because we know there are issues with getting/setting the value through APR, your patch would also need to consider: - is there a minimum APR version required for the patch to work? Seems setting APR_SO_RCVBUF was added to APR in 2003 to version 0.9.4 I don't think we support 0.9.4 anyway, Ganglia refuses to compile with it, so no extra effort needed to document that - could you set the value, query the value, and if it hasn't accepted the value, try setting the value on the native socket? - or maybe just ignore the APR code completely and go directly to set the value on the native socket? Think to be safe I will just skip all the APR weirdness and use the native socket. Unless there might be portability issues with that? Exactly - we use APR to make Ganglia safer. So we should avoid building in too much native code stuff If an apr upstream fix comes quickly, then I suggest ganglia should not include the hack, it should use the proper apr call, and people who have such heavily loaded gmonds that they need this functionality should be told it is only supported on a recent Linux/apr version. However, given that the problem is quite severe and likely to exist in most current Linux distributions, maybe the current debug messages that I added should also log a warning (or even error) message if (a) the buffer size has been set manually and (b) a bad apr is detected (or querying the value returns 0) Maybe gmond should even refuse to start if the user has requested a bigger buffer and it is not supported? Then they are forced to find out what is going on and upgrade their apr. -- ing. R. Bastiaans, B.ICT * Senior Systems Programmer * Operations, Support and Development SARA Science Park 140 PO Box 94613 1098 XG Amsterdam NL 1090 GP Amsterdam NL P.+31 (0)20 592 3000 F.+31 (0)20 668 3167 smime.p7s Description: S/MIME Cryptographic Signature -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] gmond udp receive buffer errors
On 23-4-2012 15:26, Daniel Pocock wrote: Actually, apr can be a little bit more naughty than that: for Vladimir and myself, attempting to query the buffer size from APR reports the value 0. Querying the underlying socket directly reports another value. I'm using apr-1.4.2 on Debian squeeze, which version do you have? Looking at APR's source it seems as if it only queries (on unix) if the option is set and not the actual value of the option: apr_status_t apr_socket_opt_get(apr_socket_t *sock, apr_int32_t opt, apr_int32_t *on) { switch(opt) { default: *on = apr_is_option_set(sock, opt); } return APR_SUCCESS; } So that seems to be the reason it returns 0. I also noticed that the kernel doubles the value through setsockopt/getsockopt: SO_RCVBUF Sets or gets the maximum socket receive buffer in bytes. The kernel doubles this value (to allow space for bookkeeping overhead) when it is set using setsockopt(2), and this doubled value is returned by getsockopt(2). The default value is set by the /proc/sys/net/core/rmem_default file, and the maximum allowed value is set by the /proc/sys/net/core/rmem_max file. The minimum (doubled) value for this option is 256. So the actual size is really half of what is returned by getsockopt You will notice the logging code reports two results, because of the apr issue described above For your patch, could you generalise it to allow a value in the config file? This commit will suggest how to go about adding a new config value: https://github.com/ganglia/monitor-core/commit/bfeb4ce3ad65466a3bef220bb6950403b4f968cd#gmond/conf.pod The patch should respect the previous behavior - if the config value is unspecified or 0, it should not change anything. However, because we know there are issues with getting/setting the value through APR, your patch would also need to consider: - is there a minimum APR version required for the patch to work? Seems setting APR_SO_RCVBUF was added to APR in 2003 to version 0.9.4 - could you set the value, query the value, and if it hasn't accepted the value, try setting the value on the native socket? - or maybe just ignore the APR code completely and go directly to set the value on the native socket? Think to be safe I will just skip all the APR weirdness and use the native socket. Unless there might be portability issues with that? I have a patch ready now for both method's, but seems a bit redundant to do both. Cheers, - Ramon. -- ing. R. Bastiaans, B.ICT * Senior Systems Programmer * Operations, Support and Development SARA Science Park 140 PO Box 94613 1098 XG Amsterdam NL 1090 GP Amsterdam NL P.+31 (0)20 592 3000 F.+31 (0)20 668 3167 smime.p7s Description: S/MIME Cryptographic Signature -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] gmond udp receive buffer errors
On 24/04/12 16:51, Ramon Bastiaans wrote: On 23-4-2012 15:26, Daniel Pocock wrote: Actually, apr can be a little bit more naughty than that: for Vladimir and myself, attempting to query the buffer size from APR reports the value 0. Querying the underlying socket directly reports another value. I'm using apr-1.4.2 on Debian squeeze, which version do you have? Looking at APR's source it seems as if it only queries (on unix) if the option is set and not the actual value of the option: Great, thanks for confirming the root cause of this issue However, because we know there are issues with getting/setting the value through APR, your patch would also need to consider: - is there a minimum APR version required for the patch to work? Seems setting APR_SO_RCVBUF was added to APR in 2003 to version 0.9.4 I don't think we support 0.9.4 anyway, Ganglia refuses to compile with it, so no extra effort needed to document that - could you set the value, query the value, and if it hasn't accepted the value, try setting the value on the native socket? - or maybe just ignore the APR code completely and go directly to set the value on the native socket? Think to be safe I will just skip all the APR weirdness and use the native socket. Unless there might be portability issues with that? Exactly - we use APR to make Ganglia safer. So we should avoid building in too much native code stuff If an apr upstream fix comes quickly, then I suggest ganglia should not include the hack, it should use the proper apr call, and people who have such heavily loaded gmonds that they need this functionality should be told it is only supported on a recent Linux/apr version. However, given that the problem is quite severe and likely to exist in most current Linux distributions, maybe the current debug messages that I added should also log a warning (or even error) message if (a) the buffer size has been set manually and (b) a bad apr is detected (or querying the value returns 0) Maybe gmond should even refuse to start if the user has requested a bigger buffer and it is not supported? Then they are forced to find out what is going on and upgrade their apr. -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] gmond udp receive buffer errors
This is with gmond version 3.3.1, with a simple udp_receive_channel set like this: udp_recv_channel { port = 8669 } - Ramon. On 23-4-2012 12:03, Ramon Bastiaans wrote: Hi, While troubleshooting an other network issue, I enabled the netstats.py module to report udp_rcvbufrerrors. Ironically, it seems to me as if gmond itself is experiencing udp receive buffer errors. When I check out /proc/net/udp for drops, amongst other things I see: sl local_address rem_address st tx_queue rx_queue tr tm-when retrnsmt uid timeout inode ref pointer drops 51: :21DD : 07 : 00: 1030 72590718 2 8803a1a5d140 6676 It shows a 6676 dropcount for a socket with uid: 103 When I check out which process has this uid, it is gmond: # ps -ef n | grep '103 ' 103 7800 1 0 10:32 ?Ssl0:04 /usr/sbin/gmond I have tried tweaking some sysctl settings, increasing rmem for udp and increasing the max_udp_message_len in gmond.conf but there seems to be no effect. Is this possibly a bug, or am I missing something and doing it wrong? ;) Cheers, - Ramon. -- ing. R. Bastiaans, B.ICT * Senior Systems Programmer * Operations, Support and Development SARA Science Park 140 PO Box 94613 1098 XG Amsterdam NL 1090 GP Amsterdam NL P.+31 (0)20 592 3000 F.+31 (0)20 668 3167 smime.p7s Description: S/MIME Cryptographic Signature -- For Developers, A Lot Can Happen In A Second. Boundary is the first to Know...and Tell You. Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! http://p.sf.net/sfu/Boundary-d2dvs2 ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
[Ganglia-developers] gmond udp receive buffer errors
Hi, While troubleshooting an other network issue, I enabled the netstats.py module to report udp_rcvbufrerrors. Ironically, it seems to me as if gmond itself is experiencing udp receive buffer errors. When I check out /proc/net/udp for drops, amongst other things I see: sl local_address rem_address st tx_queue rx_queue tr tm-when retrnsmt uid timeout inode ref pointer drops 51: :21DD : 07 : 00: 1030 72590718 2 8803a1a5d140 6676 It shows a 6676 dropcount for a socket with uid: 103 When I check out which process has this uid, it is gmond: # ps -ef n | grep '103 ' 103 7800 1 0 10:32 ?Ssl0:04 /usr/sbin/gmond I have tried tweaking some sysctl settings, increasing rmem for udp and increasing the max_udp_message_len in gmond.conf but there seems to be no effect. Is this possibly a bug, or am I missing something and doing it wrong? ;) Cheers, - Ramon. -- ing. R. Bastiaans, B.ICT * Senior Systems Programmer * Operations, Support and Development SARA Science Park 140 PO Box 94613 1098 XG Amsterdam NL 1090 GP Amsterdam NL P.+31 (0)20 592 3000 F.+31 (0)20 668 3167 smime.p7s Description: S/MIME Cryptographic Signature -- For Developers, A Lot Can Happen In A Second. Boundary is the first to Know...and Tell You. Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! http://p.sf.net/sfu/Boundary-d2dvs2 ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] gmond udp receive buffer errors
Hi Ramon, Vladimir asked about similar errors on IRC recently I thought buffer sizes may be an issue, so the 3.3.7 release candidate has logging of RX buffer sizes (it is logged at debug level when gmond starts). It may be interesting and helpful to compare those buffer sizes, system defaults, etc, from your own systems and other people with any similar problem. Looking at the log output should also show you whether or not gmond is using the values you tried to set at a system level. Regards, Daniel On 23/04/12 12:07, Ramon Bastiaans wrote: This is with gmond version 3.3.1, with a simple udp_receive_channel set like this: udp_recv_channel { port = 8669 } - Ramon. On 23-4-2012 12:03, Ramon Bastiaans wrote: Hi, While troubleshooting an other network issue, I enabled the netstats.py module to report udp_rcvbufrerrors. Ironically, it seems to me as if gmond itself is experiencing udp receive buffer errors. When I check out /proc/net/udp for drops, amongst other things I see: sl local_address rem_address st tx_queue rx_queue tr tm-when retrnsmt uid timeout inode ref pointer drops 51: :21DD : 07 : 00: 1030 72590718 2 8803a1a5d140 6676 It shows a 6676 dropcount for a socket with uid: 103 When I check out which process has this uid, it is gmond: # ps -ef n | grep '103 ' 103 7800 1 0 10:32 ?Ssl0:04 /usr/sbin/gmond I have tried tweaking some sysctl settings, increasing rmem for udp and increasing the max_udp_message_len in gmond.conf but there seems to be no effect. Is this possibly a bug, or am I missing something and doing it wrong? ;) Cheers, - Ramon. -- For Developers, A Lot Can Happen In A Second. Boundary is the first to Know...and Tell You. Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! http://p.sf.net/sfu/Boundary-d2dvs2 ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers -- For Developers, A Lot Can Happen In A Second. Boundary is the first to Know...and Tell You. Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! http://p.sf.net/sfu/Boundary-d2dvs2 ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] gmond udp receive buffer errors
Hi Daniel, Ah ok. Before you sent your email I had already created a small patch for myself. It almost seems that APR ignores the OS settings (i.e.: net.core.rmem_default) and creates a socket with it's own default (receive) buffer size. Attached is a patch against 3.3.6 for lib/apr_net.c that stops the receive buffers errors for me. The patch sets the buffer size a bit bigger, although I'm not sure what would be a sensible size for gmond. I would think if you have a large cluster with lots of UDP traffic you would need a bigger receive buffer than for smaller systems. I will try out 3.3.7 and see what it's debug output says on buffer size's. Kind regards, - Ramon. On 23-4-2012 14:40, Daniel Pocock wrote: Hi Ramon, Vladimir asked about similar errors on IRC recently I thought buffer sizes may be an issue, so the 3.3.7 release candidate has logging of RX buffer sizes (it is logged at debug level when gmond starts). It may be interesting and helpful to compare those buffer sizes, system defaults, etc, from your own systems and other people with any similar problem. Looking at the log output should also show you whether or not gmond is using the values you tried to set at a system level. Regards, Daniel On 23/04/12 12:07, Ramon Bastiaans wrote: This is with gmond version 3.3.1, with a simple udp_receive_channel set like this: udp_recv_channel { port = 8669 } - Ramon. On 23-4-2012 12:03, Ramon Bastiaans wrote: Hi, While troubleshooting an other network issue, I enabled the netstats.py module to report udp_rcvbufrerrors. Ironically, it seems to me as if gmond itself is experiencing udp receive buffer errors. When I check out /proc/net/udp for drops, amongst other things I see: sl local_address rem_address st tx_queue rx_queue tr tm-when retrnsmt uid timeout inode ref pointer drops 51: :21DD : 07 : 00: 1030 72590718 2 8803a1a5d140 6676 It shows a 6676 dropcount for a socket with uid: 103 When I check out which process has this uid, it is gmond: # ps -ef n | grep '103 ' 103 7800 1 0 10:32 ?Ssl0:04 /usr/sbin/gmond I have tried tweaking some sysctl settings, increasing rmem for udp and increasing the max_udp_message_len in gmond.conf but there seems to be no effect. Is this possibly a bug, or am I missing something and doing it wrong? ;) Cheers, - Ramon. -- For Developers, A Lot Can Happen In A Second. Boundary is the first to Know...and Tell You. Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! http://p.sf.net/sfu/Boundary-d2dvs2 ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers -- For Developers, A Lot Can Happen In A Second. Boundary is the first to Know...and Tell You. Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! http://p.sf.net/sfu/Boundary-d2dvs2 ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers -- ing. R. Bastiaans, B.ICT * Senior Systems Programmer * Operations, Support and Development SARA Science Park 140 PO Box 94613 1098 XG Amsterdam NL 1090 GP Amsterdam NL P.+31 (0)20 592 3000 F.+31 (0)20 668 3167 --- apr_net.c.old 2012-04-13 03:02:27.0 +0200 +++ apr_net.c 2012-04-23 15:00:57.839151626 +0200 @@ -202,6 +202,12 @@ apr_socket_close(sock); return NULL; } + stat = apr_socket_opt_set(sock, APR_SO_RCVBUF, 1024000); + if (stat != APR_SUCCESS) +{ + apr_socket_close(sock); + return NULL; +} if(!localsa) { smime.p7s Description: S/MIME Cryptographic Signature -- For Developers, A Lot Can Happen In A Second. Boundary is the first to Know...and Tell You. Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! http://p.sf.net/sfu/Boundary-d2dvs2 ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] gmond udp receive buffer errors
On 23/04/12 22:24, Vladimir Vuksan wrote: I was having identical issues. I used your patch with the exception that I bumped up buffer size first to 10M from 1M you had. There was a massive improvement but still was seeing some drops so I just decided to bump it up to 30M and it's even better although I still see occasional drops. If you have such a big buffer, then you could also have latency issues, as it suggests your CPU is just not able to process all the work in time You would either need to revise the workload (by splitting clusters, etc) or re-write gmond to be multithreaded (so it can use more cores) -- For Developers, A Lot Can Happen In A Second. Boundary is the first to Know...and Tell You. Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! http://p.sf.net/sfu/Boundary-d2dvs2 ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] gmond udp receive buffer errors
Right. I have a few VMs aggregators as well as physical hardware. VMs have more issues than physical hardware but are still susceptible to loss. This is very evident with metrics that arrive at the same time e.g. cron triggered gmetric jobs. Also something unexpected happened. I have two VMs that are a pair ie. all nodes send metrics to both in case one fails we still have metrics. I upgraded e.g. aggregator2. I did not touch aggregator1 yet UDP errors vanished on aggregator1 as well. Puzzling. Vladimir On Mon, 23 Apr 2012, Daniel Pocock wrote: On 23/04/12 22:24, Vladimir Vuksan wrote: I was having identical issues. I used your patch with the exception that I bumped up buffer size first to 10M from 1M you had. There was a massive improvement but still was seeing some drops so I just decided to bump it up to 30M and it's even better although I still see occasional drops. If you have such a big buffer, then you could also have latency issues, as it suggests your CPU is just not able to process all the work in time You would either need to revise the workload (by splitting clusters, etc) or re-write gmond to be multithreaded (so it can use more cores) -- For Developers, A Lot Can Happen In A Second. Boundary is the first to Know...and Tell You. Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! http://p.sf.net/sfu/Boundary-d2dvs2 ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] gmond udp receive buffer errors
I was having identical issues. I used your patch with the exception that I bumped up buffer size first to 10M from 1M you had. There was a massive improvement but still was seeing some drops so I just decided to bump it up to 30M and it's even better although I still see occasional drops. To really see the effect you need to in addition to rcvbuffer track udp_inerrors. Vladimir On Mon, 23 Apr 2012, Ramon Bastiaans wrote: Ah ok. Before you sent your email I had already created a small patch for myself. It almost seems that APR ignores the OS settings (i.e.: net.core.rmem_default) and creates a socket with it's own default (receive) buffer size. Attached is a patch against 3.3.6 for lib/apr_net.c that stops the receive buffers errors for me. The patch sets the buffer size a bit bigger, although I'm not sure what would be a sensible size for gmond. I would think if you have a large cluster with lots of UDP traffic you would need a bigger receive buffer than for smaller systems. I will try out 3.3.7 and see what it's debug output says on buffer size's. -- For Developers, A Lot Can Happen In A Second. Boundary is the first to Know...and Tell You. Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! http://p.sf.net/sfu/Boundary-d2dvs2 ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers