Hi all, On 9. Jul 2014, at 10:48, "ROHN Carsten [via Software]" <[email protected]> wrote:
> From your description, I think you're getting a problem with one of the major > bottlenecks of collage: reading from many connections happens in one thread. > We see this issue mostly in HPC context, where we have up to 150 nodes > rendering for one master. One thread is just too slow to handle so many > connections. I'm not sure. Since it stalls heavily every now and then I would lean towards another problem which stalls the normal processing at around 100fps, something like garbage collection in Java. > We even have the effect of connections "starving", because the reader doesn't > even have the chance to select the connections in the "back" of the > connection set. This is another thing to be done for Collage: handle > connections fairly on the reader side, to avoid the starving effect. This is implemented on Linux, but not yet on Windows (see https://github.com/Eyescale/Collage/issues/38). > This topic is very interesting for us as well, please keep me updated. Indeed :) > P.S.: Maybe you're not even having a software issue. We had a similar problem > at a customer (twice by now). After quite some analysis we identified the > switch as one of the problems, and a virus scan with "intrusion prevention" > was deep scanning all the packets and therefor stalling network traffic. But > I guess, you're not running windows, are you? Yes, as said above I would lean in this direction of investigation. > What I am observing is best described as jittering or frame stuttering, > happening about every 1 second. The frame rate will drop from 100s of FPS > down to 1-2 very briefly and then recover, only to happen soon there after. > This is not related to rendering complexity (it happens with very simple > scenes and also with the various eq samples). > > I did some profiling on the AppNode driving the cluster in order to narrow > down the source of the issue. I am noticing hotspots in > co::LocalNode::_runReceiverThread (38.17% of all samples). In particular, > there seems to be a bunch of time spent within co::LocalNode::_handleData > (26.6% of all CPU time) and approximately 12.7% for the call to > ConnectionSet::select() within the same function (_runReceiverThread). The > second hotspot that I've noticed is in the ServerThread::run() function and > more specifically in _cmdStartFrame() (roughly 25% of CPU time spent there). I don't think basic profiling will find the issue. Can you somehow get a profile when the stall happens, to see what is going on there? The times look about right. The receiver thread + handle data is likely your bottleneck, as Carsten said above. With 70 clients they get quite a bit of load, but since you are running at >60FPS normally I would not worry yet. The startFrame is also not surprising. There has been no optimization done on that code at all, since so far it has never shown up as a bottleneck. The whole traversal and task generation code can however be optimized by caching information, once the hotspots are identified. > Meanwhile, I'd love to get people's input on the above! I would really like to find out what the cause for the stall is. It seems that normally you are fine, so a one second pause seems to have a special cause like an external influence or strange code path. HTH, Stefan. signature.asc (858 bytes) <http://software.1713.n2.nabble.com/attachment/7585933/0/signature.asc> -- View this message in context: http://software.1713.n2.nabble.com/Jittery-performance-with-large-cluster-tp7585928p7585933.html Sent from the Equalizer - Parallel Rendering mailing list archive at Nabble.com. _______________________________________________ eq-dev mailing list [email protected] http://www.equalizergraphics.com/cgi-bin/mailman/listinfo/eq-dev http://www.equalizergraphics.com

