D H wrote:

I agree about this, they wanted proof of a fix before any changes would be
made and my manager is still saying there must be proof before code changes
even now after I showed him your email and the documentation from the site.

Nice way of thinking -- at a university where time and money are infinite resources. Wishful thinking anywhere else. Enter The Real World.

Anyway if this is so hard to reproduce and profiling didn't give you any
idea what's broken, why do you suspect something in this particular piece of
code is the cause of your problem?


That's a very good question, I was given the source code, no explanation of
their code and told HttpClient was the problem.

Why am I not convinced? But it's well possible. My best advice I can give is: Upgrade for God's sake, and fix obvious mistakes in the use of the API. Use some monitoring tools like JConsole / JVisualVM, jmap, netstat, top and the like to see if you have a garbage collection problem or any obvious resource leaks, then take appropriate action.

> It is a rather large
codebase so I'm taking their word for it.  They've apparently worked on it
sporadically for a couple of months to isolate it to HttpClient, and my task
is to prove it is causing this problem with JMeter in a self-contained code
sample.  This problem has only been seen in Production and only after almost
a week of running 24/7 so it's hard to duplicate it easily.  I've sent over
two hundred thousand HttpClient requests without seeing the problem so I'd
rather see this code fix go to Production and test that way personally.

Sounds like a a typical production problem to me: it can take weeks to see it, there is no way you can trigger it on purpose and maybe it has only ever been encountered on the production system.

Face it, you will not reproduce it locally in reasonable time. It is maybe dependent on the workload you are running. Maybe it's even platform specific and may not trigger on your testing platform. And with platform I don't mean just the OS. Also the processor type (Single core, Multi core) can make a huge difference.

What can really help you is to expect the situation in production. And instead of panicing and quickly restarting, take your time, having the right tools at hand to find out what's going on in this moment. Maybe even a "post mortem" job that gathers useful information in case this happens during everyone is sleeping. Is it swapping? Has the VM run out of memory (stack, heap, perm gen, code) and is constantly GCing? Has the OS run out of file descriptors? To which signals does it react? Is it creating threads at a high rate? Are there just too many runnable threads? Is it busy waiting? Is it looping endlessly? Is it I/O bound? Is it lock contented or even deadlocked? Is it blocking on I/O or network? Is it waiting for the DB? What's going on on the DB? What's going on on the network? Is your log detailes enough to give you the information you need?

If all fails you, you may have to live with it and rather setup monitoring infrastructure that can reliably detect the situation and restart the process.


I really appreciate you taking the time to answer my emails, thank you very
much.

Sincerely,
David Hamilton

Ortwin

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to