we recently made the switch from the whalin client to spy and seem to be running into problems under heavy concurrency/load in our front-end servers and i was wondering if anybody (dustin, perhaps?) had any ideas for strategies to deal with it.
the majority of our front-end servers are sun fire t1000s (8 cores, 4 threads per core) running solaris 10, so obviously the spy client works a lot better for us in the vast majority of cases-- the synchronized blocks in the whalin connection pool gave us a lot of contention problems in particular. when the systems get busy, though, it seems that i/o can't keep up and we start seeing a lot of timeouts, which in turn has a domino effect and effectively brings down the entire cluster. the problem is that the machines aren't even reaching 60% cpu when this happens. does my diagnosis of the problem seem right and, if so, any ideas for the best way to deal with this? obviously adjusting timeouts would probably only exacerbate the problem, so i toyed with the idea of having a pool of clients (though i haven't really delved into the code to see if that's feasible or would help at all) or possibly hacking it to change how its i/o threads work. for now, we've just added a few more machines to this cluster, but it seems like a waste of hardware when i know that these things can operate above 90% cpu for a sustained period with no problem. thanks... any help would be great and let me know if you have any more questions about specifics. -- awl
