Hi all, I've been trying to track a problem at work for the last couple of days but my efforts have been scuppered by a general lack of knowledge of low-ish level debugging/tracing tools.
The problem we have is that a closed-source multi-threaded program we are using hits issues when we get it to spawn over roughly 250 threads. I'm deliberately preserving the program's anonymity so as to protect the (presumed) innocent at this point. When the program fails, we end up with messages related to failures in the select() call with both 'bad file descriptor' and 'invalid argument' errors. At the moment, we're not sure whether this is a problem with the closed source program, or with some user-space or kernel configuration or bug. The server we're running this on is a 4-CPU quad-core, 16GB RAM box, so has plenty of grunt. The program also runs on a Windows laptop without any issues with 500 threads! Obvious things like 'top' and 'df' don't show any problems. I've tried running a couple of systemtap scripts (the nettop and socket-trace examples) but they don't appear to show much of any use. When I attach 'strace' to the running process, it just seems to be hanging on futex_wait so I can't see any select() calls or their arguments. So, does anyone have any secrets in their sysadmin toolbox that may be of use here? Any help at all would be greatly appreciated, even if it's just pointers to more suitable mailing lists. Thanks in advance, Matt. -- http://linuxfromscratch.org/mailman/listinfo/lfs-chat FAQ: http://www.linuxfromscratch.org/faq/ Unsubscribe: See the above information page
