Re: [Ganglia-developers] gmond udp receive buffer errors

2012-05-02 Thread Daniel Pocock


On 26/04/12 10:38, Ramon Bastiaans wrote:
 I just sent in this:
 
  * https://github.com/ganglia/monitor-core/pull/34
 
 I changed the patch to behave as you described. See the pull request for
 details.

Hi Ramon,

Thanks for contributing this patch, I see it is already checked by Jeff
so I've only had a quick glance at the code to make sure that it
preserves legacy behavior and is suitable for the next 3.3.x release.
It will be in the next release candidate for people to test.

Regards,

Daniel


--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers


Re: [Ganglia-developers] gmond udp receive buffer errors

2012-04-26 Thread Ramon Bastiaans

I just sent in this:

 * https://github.com/ganglia/monitor-core/pull/34

I changed the patch to behave as you described. See the pull request for 
details.



Cheers,
- Ramon.

On 24-4-2012 17:47, Daniel Pocock wrote:


On 24/04/12 16:51, Ramon Bastiaans wrote:

On 23-4-2012 15:26, Daniel Pocock wrote:

Actually, apr can be a little bit more naughty than that: for Vladimir
and myself, attempting to query the buffer size from APR reports the
value 0. Querying the underlying socket directly reports another
value. I'm using apr-1.4.2 on Debian squeeze, which version do you have?

Looking at APR's source it seems as if it only queries (on unix) if the
option is set and not the actual value of the option:


Great, thanks for confirming the root cause of this issue


However, because we know there are issues with getting/setting the value
through APR, your patch would also need to consider:

- is there a minimum APR version required for the patch to work?

Seems setting APR_SO_RCVBUF was added to APR in 2003 to version 0.9.4

I don't think we support 0.9.4 anyway, Ganglia refuses to compile with
it, so no extra effort needed to document that


- could you set the value, query the value, and if it hasn't accepted
the value, try setting the value on the native socket?
- or maybe just ignore the APR code completely and go directly to set
the value on the native socket?

Think to be safe I will just skip all the APR weirdness and use the
native socket. Unless there might be portability issues with that?

Exactly - we use APR to make Ganglia safer.  So we should avoid building
in too much native code stuff

If an apr upstream fix comes quickly, then I suggest ganglia should not
include the hack, it should use the proper apr call, and people who have
such heavily loaded gmonds that they need this functionality should be
told it is only supported on a recent Linux/apr version.

However, given that the problem is quite severe and likely to exist in
most current Linux distributions, maybe the current debug messages that
I added should also log a warning (or even error) message if
(a) the buffer size has been set manually and
(b) a bad apr is detected (or querying the value returns 0)

Maybe gmond should even refuse to start if the user has requested a
bigger buffer and it is not supported?  Then they are forced to find out
what is going on and upgrade their apr.


--
ing. R. Bastiaans, B.ICT
* Senior Systems Programmer
* Operations, Support and Development

SARA
Science Park 140 PO Box 94613
1098 XG Amsterdam NL 1090 GP Amsterdam NL
P.+31 (0)20 592 3000 F.+31 (0)20 668 3167




smime.p7s
Description: S/MIME Cryptographic Signature
--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/___
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers


Re: [Ganglia-developers] gmond udp receive buffer errors

2012-04-24 Thread Ramon Bastiaans


On 23-4-2012 15:26, Daniel Pocock wrote:
Actually, apr can be a little bit more naughty than that: for Vladimir 
and myself, attempting to query the buffer size from APR reports the 
value 0. Querying the underlying socket directly reports another 
value. I'm using apr-1.4.2 on Debian squeeze, which version do you have? 


Looking at APR's source it seems as if it only queries (on unix) if the 
option is set and not the actual value of the option:


apr_status_t apr_socket_opt_get(apr_socket_t *sock,
apr_int32_t opt, apr_int32_t *on)
{
switch(opt) {
default:
*on = apr_is_option_set(sock, opt);
}
return APR_SUCCESS;
}

So that seems to be the reason it returns 0.

I also noticed that the kernel doubles the value through 
setsockopt/getsockopt:


   SO_RCVBUF
  Sets or gets the maximum socket receive buffer in bytes.  
The kernel doubles this value (to allow space for bookkeeping overhead) 
when it is set using setsockopt(2), and this doubled value is
  returned by getsockopt(2).  The default value is set by 
the /proc/sys/net/core/rmem_default file, and the maximum allowed value 
is set by the /proc/sys/net/core/rmem_max  file.   The  minimum

  (doubled) value for this option is 256.

So the actual size is really half of what is returned by getsockopt


You will notice the logging code reports two results, because of the apr
issue described above

For your patch, could you generalise it to allow a value in the config
file?  This commit will suggest how to go about adding a new config value:

https://github.com/ganglia/monitor-core/commit/bfeb4ce3ad65466a3bef220bb6950403b4f968cd#gmond/conf.pod

The patch should respect the previous behavior - if the config value is
unspecified or 0, it should not change anything.

However, because we know there are issues with getting/setting the value
through APR, your patch would also need to consider:

- is there a minimum APR version required for the patch to work?


Seems setting APR_SO_RCVBUF was added to APR in 2003 to version 0.9.4



- could you set the value, query the value, and if it hasn't accepted
the value, try setting the value on the native socket?
- or maybe just ignore the APR code completely and go directly to set
the value on the native socket?
Think to be safe I will just skip all the APR weirdness and use the 
native socket. Unless there might be portability issues with that?


I have a patch ready now for both method's, but seems a bit redundant to 
do both.




Cheers,
- Ramon.

--
ing. R. Bastiaans, B.ICT
* Senior Systems Programmer
* Operations, Support and Development

SARA
Science Park 140 PO Box 94613
1098 XG Amsterdam NL 1090 GP Amsterdam NL
P.+31 (0)20 592 3000 F.+31 (0)20 668 3167




smime.p7s
Description: S/MIME Cryptographic Signature
--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/___
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers


Re: [Ganglia-developers] gmond udp receive buffer errors

2012-04-24 Thread Daniel Pocock


On 24/04/12 16:51, Ramon Bastiaans wrote:
 
 On 23-4-2012 15:26, Daniel Pocock wrote:
 Actually, apr can be a little bit more naughty than that: for Vladimir
 and myself, attempting to query the buffer size from APR reports the
 value 0. Querying the underlying socket directly reports another
 value. I'm using apr-1.4.2 on Debian squeeze, which version do you have? 
 
 Looking at APR's source it seems as if it only queries (on unix) if the
 option is set and not the actual value of the option:


Great, thanks for confirming the root cause of this issue

 However, because we know there are issues with getting/setting the value
 through APR, your patch would also need to consider:

 - is there a minimum APR version required for the patch to work?
 
 Seems setting APR_SO_RCVBUF was added to APR in 2003 to version 0.9.4

I don't think we support 0.9.4 anyway, Ganglia refuses to compile with
it, so no extra effort needed to document that

 
 - could you set the value, query the value, and if it hasn't accepted
 the value, try setting the value on the native socket?
 - or maybe just ignore the APR code completely and go directly to set
 the value on the native socket?
 Think to be safe I will just skip all the APR weirdness and use the
 native socket. Unless there might be portability issues with that?

Exactly - we use APR to make Ganglia safer.  So we should avoid building
in too much native code stuff

If an apr upstream fix comes quickly, then I suggest ganglia should not
include the hack, it should use the proper apr call, and people who have
such heavily loaded gmonds that they need this functionality should be
told it is only supported on a recent Linux/apr version.

However, given that the problem is quite severe and likely to exist in
most current Linux distributions, maybe the current debug messages that
I added should also log a warning (or even error) message if
(a) the buffer size has been set manually and
(b) a bad apr is detected (or querying the value returns 0)

Maybe gmond should even refuse to start if the user has requested a
bigger buffer and it is not supported?  Then they are forced to find out
what is going on and upgrade their apr.

--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers


Re: [Ganglia-developers] gmond udp receive buffer errors

2012-04-23 Thread Ramon Bastiaans
This is with gmond version 3.3.1, with a simple udp_receive_channel set 
like this:


udp_recv_channel {
  port = 8669
}


- Ramon.

On 23-4-2012 12:03, Ramon Bastiaans wrote:

Hi,

While troubleshooting an other network issue, I enabled the 
netstats.py module to report udp_rcvbufrerrors.


Ironically, it seems to me as if gmond itself is experiencing udp 
receive buffer errors.


When I check out /proc/net/udp for drops, amongst other things I see:

  sl  local_address rem_address   st tx_queue rx_queue tr tm-when 
retrnsmt   uid  timeout inode ref pointer drops
  51: :21DD : 07 : 00: 
   1030 72590718 2 8803a1a5d140 6676


It shows a 6676 dropcount for a socket with uid: 103

When I check out which process has this uid, it is gmond:

# ps -ef n | grep '103 '
 103  7800 1  0 10:32 ?Ssl0:04 /usr/sbin/gmond

I have tried tweaking some sysctl settings, increasing rmem for udp 
and increasing the max_udp_message_len in gmond.conf but there seems 
to be no effect.


Is this possibly a bug, or am I missing something and doing it wrong? ;)


Cheers,
- Ramon.



--
ing. R. Bastiaans, B.ICT
* Senior Systems Programmer
* Operations, Support and Development

SARA
Science Park 140 PO Box 94613
1098 XG Amsterdam NL 1090 GP Amsterdam NL
P.+31 (0)20 592 3000 F.+31 (0)20 668 3167




smime.p7s
Description: S/MIME Cryptographic Signature
--
For Developers, A Lot Can Happen In A Second.
Boundary is the first to Know...and Tell You.
Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
http://p.sf.net/sfu/Boundary-d2dvs2
___
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers


[Ganglia-developers] gmond udp receive buffer errors

2012-04-23 Thread Ramon Bastiaans

Hi,

While troubleshooting an other network issue, I enabled the netstats.py 
module to report udp_rcvbufrerrors.


Ironically, it seems to me as if gmond itself is experiencing udp 
receive buffer errors.


When I check out /proc/net/udp for drops, amongst other things I see:

  sl  local_address rem_address   st tx_queue rx_queue tr tm-when 
retrnsmt   uid  timeout inode ref pointer drops
  51: :21DD : 07 : 00: 
   1030 72590718 2 8803a1a5d140 6676


It shows a 6676 dropcount for a socket with uid: 103

When I check out which process has this uid, it is gmond:

# ps -ef n | grep '103 '
 103  7800 1  0 10:32 ?Ssl0:04 /usr/sbin/gmond

I have tried tweaking some sysctl settings, increasing rmem for udp and 
increasing the max_udp_message_len in gmond.conf but there seems to be 
no effect.


Is this possibly a bug, or am I missing something and doing it wrong? ;)


Cheers,
- Ramon.

--
ing. R. Bastiaans, B.ICT
* Senior Systems Programmer
* Operations, Support and Development

SARA
Science Park 140 PO Box 94613
1098 XG Amsterdam NL 1090 GP Amsterdam NL
P.+31 (0)20 592 3000 F.+31 (0)20 668 3167




smime.p7s
Description: S/MIME Cryptographic Signature
--
For Developers, A Lot Can Happen In A Second.
Boundary is the first to Know...and Tell You.
Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
http://p.sf.net/sfu/Boundary-d2dvs2
___
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers


Re: [Ganglia-developers] gmond udp receive buffer errors

2012-04-23 Thread Daniel Pocock



Hi Ramon,

Vladimir asked about similar errors on IRC recently

I thought buffer sizes may be an issue, so the 3.3.7 release candidate
has logging of RX buffer sizes (it is logged at debug level when gmond
starts).  It may be interesting and helpful to compare those buffer
sizes, system defaults, etc, from your own systems and other people with
any similar problem.  Looking at the log output should also show you
whether or not gmond is using the values you tried to set at a system level.

Regards,

Daniel

On 23/04/12 12:07, Ramon Bastiaans wrote:
 This is with gmond version 3.3.1, with a simple udp_receive_channel set
 like this:
 
 udp_recv_channel {
   port = 8669
 }
 
 
 - Ramon.
 
 On 23-4-2012 12:03, Ramon Bastiaans wrote:
 Hi,

 While troubleshooting an other network issue, I enabled the
 netstats.py module to report udp_rcvbufrerrors.

 Ironically, it seems to me as if gmond itself is experiencing udp
 receive buffer errors.

 When I check out /proc/net/udp for drops, amongst other things I see:

   sl  local_address rem_address   st tx_queue rx_queue tr tm-when
 retrnsmt   uid  timeout inode ref pointer drops
   51: :21DD : 07 : 00:
    1030 72590718 2 8803a1a5d140 6676

 It shows a 6676 dropcount for a socket with uid: 103

 When I check out which process has this uid, it is gmond:

 # ps -ef n | grep '103 '
  103  7800 1  0 10:32 ?Ssl0:04 /usr/sbin/gmond

 I have tried tweaking some sysctl settings, increasing rmem for udp
 and increasing the max_udp_message_len in gmond.conf but there seems
 to be no effect.

 Is this possibly a bug, or am I missing something and doing it wrong? ;)


 Cheers,
 - Ramon.

 
 
 
 --
 For Developers, A Lot Can Happen In A Second.
 Boundary is the first to Know...and Tell You.
 Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
 http://p.sf.net/sfu/Boundary-d2dvs2
 
 
 
 ___
 Ganglia-developers mailing list
 Ganglia-developers@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/ganglia-developers

--
For Developers, A Lot Can Happen In A Second.
Boundary is the first to Know...and Tell You.
Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
http://p.sf.net/sfu/Boundary-d2dvs2
___
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers


Re: [Ganglia-developers] gmond udp receive buffer errors

2012-04-23 Thread Ramon Bastiaans

Hi Daniel,

Ah ok. Before you sent your email I had already created a small patch 
for myself. It almost seems that APR ignores the OS settings (i.e.: 
net.core.rmem_default) and creates a socket with it's own default 
(receive) buffer size.


Attached is a patch against 3.3.6 for lib/apr_net.c that stops the 
receive buffers errors for me.


The patch sets the buffer size a bit bigger, although I'm not sure what 
would be a sensible size for gmond. I would think if you have a large 
cluster with lots of UDP traffic you would need a bigger receive buffer 
than for smaller systems.


I will try out 3.3.7 and see what it's debug output says on buffer size's.


Kind regards,
- Ramon.


On 23-4-2012 14:40, Daniel Pocock wrote:



Hi Ramon,

Vladimir asked about similar errors on IRC recently

I thought buffer sizes may be an issue, so the 3.3.7 release candidate
has logging of RX buffer sizes (it is logged at debug level when gmond
starts).  It may be interesting and helpful to compare those buffer
sizes, system defaults, etc, from your own systems and other people with
any similar problem.  Looking at the log output should also show you
whether or not gmond is using the values you tried to set at a system level.

Regards,

Daniel

On 23/04/12 12:07, Ramon Bastiaans wrote:

This is with gmond version 3.3.1, with a simple udp_receive_channel set
like this:

udp_recv_channel {
   port = 8669
}


- Ramon.

On 23-4-2012 12:03, Ramon Bastiaans wrote:

Hi,

While troubleshooting an other network issue, I enabled the
netstats.py module to report udp_rcvbufrerrors.

Ironically, it seems to me as if gmond itself is experiencing udp
receive buffer errors.

When I check out /proc/net/udp for drops, amongst other things I see:

   sl  local_address rem_address   st tx_queue rx_queue tr tm-when
retrnsmt   uid  timeout inode ref pointer drops
   51: :21DD : 07 : 00:
   1030 72590718 2 8803a1a5d140 6676

It shows a 6676 dropcount for a socket with uid: 103

When I check out which process has this uid, it is gmond:

# ps -ef n | grep '103 '
  103  7800 1  0 10:32 ?Ssl0:04 /usr/sbin/gmond

I have tried tweaking some sysctl settings, increasing rmem for udp
and increasing the max_udp_message_len in gmond.conf but there seems
to be no effect.

Is this possibly a bug, or am I missing something and doing it wrong? ;)


Cheers,
- Ramon.




--
For Developers, A Lot Can Happen In A Second.
Boundary is the first to Know...and Tell You.
Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
http://p.sf.net/sfu/Boundary-d2dvs2



___
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers

--
For Developers, A Lot Can Happen In A Second.
Boundary is the first to Know...and Tell You.
Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
http://p.sf.net/sfu/Boundary-d2dvs2
___
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers


--
ing. R. Bastiaans, B.ICT
* Senior Systems Programmer
* Operations, Support and Development

SARA
Science Park 140 PO Box 94613
1098 XG Amsterdam NL 1090 GP Amsterdam NL
P.+31 (0)20 592 3000 F.+31 (0)20 668 3167

--- apr_net.c.old   2012-04-13 03:02:27.0 +0200
+++ apr_net.c   2012-04-23 15:00:57.839151626 +0200
@@ -202,6 +202,12 @@
   apr_socket_close(sock);
   return NULL;
 }
+  stat = apr_socket_opt_set(sock, APR_SO_RCVBUF, 1024000);
+  if (stat != APR_SUCCESS)
+{
+  apr_socket_close(sock);
+  return NULL;
+}
 
   if(!localsa)
 {


smime.p7s
Description: S/MIME Cryptographic Signature
--
For Developers, A Lot Can Happen In A Second.
Boundary is the first to Know...and Tell You.
Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
http://p.sf.net/sfu/Boundary-d2dvs2
___
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers


Re: [Ganglia-developers] gmond udp receive buffer errors

2012-04-23 Thread Daniel Pocock


On 23/04/12 22:24, Vladimir Vuksan wrote:
 I was having identical issues. I used your patch with the exception that
 I bumped up buffer size first to 10M from 1M you had. There was a
 massive improvement but still was seeing some drops so I just decided to
 bump it up to 30M and it's even better although I still see occasional
 drops.

If you have such a big buffer, then you could also have latency issues,
as it suggests your CPU is just not able to process all the work in time

You would either need to revise the workload (by splitting clusters,
etc) or re-write gmond to be multithreaded (so it can use more cores)


--
For Developers, A Lot Can Happen In A Second.
Boundary is the first to Know...and Tell You.
Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
http://p.sf.net/sfu/Boundary-d2dvs2
___
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers


Re: [Ganglia-developers] gmond udp receive buffer errors

2012-04-23 Thread Vladimir Vuksan
Right. I have a few VMs aggregators as well as physical hardware. VMs have 
more issues than physical hardware but are still susceptible to loss. This 
is very evident with metrics that arrive at the same time e.g. cron 
triggered gmetric jobs.

Also something unexpected happened. I have two VMs that are a pair ie. all 
nodes send metrics to both in case one fails we still have metrics. I 
upgraded e.g. aggregator2. I did not touch aggregator1 yet UDP errors 
vanished on aggregator1 as well. Puzzling.

Vladimir

On Mon, 23 Apr 2012, Daniel Pocock wrote:



 On 23/04/12 22:24, Vladimir Vuksan wrote:
 I was having identical issues. I used your patch with the exception that
 I bumped up buffer size first to 10M from 1M you had. There was a
 massive improvement but still was seeing some drops so I just decided to
 bump it up to 30M and it's even better although I still see occasional
 drops.

 If you have such a big buffer, then you could also have latency issues,
 as it suggests your CPU is just not able to process all the work in time

 You would either need to revise the workload (by splitting clusters,
 etc) or re-write gmond to be multithreaded (so it can use more cores)



--
For Developers, A Lot Can Happen In A Second.
Boundary is the first to Know...and Tell You.
Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
http://p.sf.net/sfu/Boundary-d2dvs2
___
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers


Re: [Ganglia-developers] gmond udp receive buffer errors

2012-04-23 Thread Vladimir Vuksan
I was having identical issues. I used your patch with the exception that I 
bumped up buffer size first to 10M from 1M you had. There was a massive 
improvement but still was seeing some drops so I just decided to bump it 
up to 30M and it's even better although I still see occasional drops.

To really see the effect you need to in addition to rcvbuffer track 
udp_inerrors.

Vladimir

On Mon, 23 Apr 2012, Ramon Bastiaans wrote:

 Ah ok. Before you sent your email I had already created a small patch for 
 myself. It almost seems that APR ignores the OS settings (i.e.: 
 net.core.rmem_default) and creates a socket with it's own default (receive) 
 buffer size.

 Attached is a patch against 3.3.6 for lib/apr_net.c that stops the receive 
 buffers errors for me.

 The patch sets the buffer size a bit bigger, although I'm not sure what would 
 be a sensible size for gmond. I would think if you have a large cluster with 
 lots of UDP traffic you would need a bigger receive buffer than for smaller 
 systems.

 I will try out 3.3.7 and see what it's debug output says on buffer size's.

--
For Developers, A Lot Can Happen In A Second.
Boundary is the first to Know...and Tell You.
Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
http://p.sf.net/sfu/Boundary-d2dvs2
___
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers