We've got a few interesting problems occurring at a network level in our 
systems and I was hoping for some pointers on how to troubleshoot them.  We're 
connecting to both OpenLDAP and Microsoft AD using LDAP API client.  For 
testing, I've built a quick fixture using 2.0.0-AM4.

1)  We have a variety of devices (security and load-balancers) between our LDAP 
client and the LDAP servers.  Using a connection pool with testWhileIdle = true 
and a timeBetweenEvictionRunsMillis = 3700000 (over an hour) we're seeing the 
load-balancer disconnect unused connections (this is the expected behavior at 1 
hour).  When this happens, we'll typically see between one and four of the 
eight connections receive a RST/ACK in Wireshark which seems to trigger the 
pool to fill back up.  The odd thing is that regardless of how many RST/ACK we 
see, there are always three new connections established and (presumably) added 
to the pool.  We've also got MinIdle = 8, so you would think that this would 
result in the pool being larger or smaller depending on the number of RST/ACKs 
we receive but the pool always reports 8 idle connections.  We're seeing 
symptoms of there being broken connections in the pool and, as expected, if we 
set testOnBorrow = true, this behavior disappears.  My concern is that this 
allows the LDAP operation to proceed but our effective idle connections may be 
far lower than we expect.  Any idea how we might best trouble-shoot the pool's 
behavior?  My guess is that since numTestPerEvictionRun defaults to 3, the 
incoming RST/ACKs start one test cycle regardless of how many RST/ACKs we 
receive.

We've got a couple processes that scale running threads up and down based on 
load and I'm concerned that if the pool thinks it has idle connections, it's 
going to lend a broken connection to a thread that spins up.

2)  We have an AD server that periodically locks up.  It's connections look 
fine but it will never answer queries.  We're looking into why only one of the 
six servers has this pathology, but I'm wondering whether using the 
LookupLdapConnectionValidator might help detect when this problem occurs.  I 
know the validators are for detecting when a connections binding has changed 
but performing the lookup has the side-effect (for us) of showing that the 
connection isn't actually functional.  This problem is not frequent enough to 
really troubleshoot and has been solved by rebooting the server so I don't 
think the long-term answer is a change to how we're using the LDAP API.

3)  Our legacy load-balancer also drops connections that are unused, but 
doesn't send a signal (RST/ACK or FIN/ACK) to the client at all.  Are we even 
going to be able to detect these?  It seems like using the 
LookupLdapConnectionValidator might help with this as well.

One final observation - I'm trying to understand why the default lifo 
configuration is true.  If a stack is effectively being used to manage 
connections, won't the connections "on the bottom" generally be very stale?  If 
lifo = false results in the use of a fifo, won't that tend to balance the use 
of the connections in the pool?

Hopefully this all makes sense ... I should note that in general the LDAP API 
is working well against a very old version of AD, a new version of AD and 
several versions of OpenLDAP.  Right now, we're working towards finding a 
connection and pool configuration that best handles the active network devices 
used to provide resiliency and security to our organization.

Thanks for any insights you might provide!

Steve

Reply via email to