Hi all,

On 9. Jul 2014, at 10:48, "ROHN Carsten [via Software]" 
<[email protected]> wrote:

> From your description, I think you're getting a problem with one of the major 
> bottlenecks of collage: reading from many connections happens in one thread. 
> We see this issue mostly in HPC context, where we have up to 150 nodes 
> rendering for one master. One thread is just too slow to handle so many 
> connections. 

I'm not sure. Since it stalls heavily every now and then I would lean towards 
another problem which stalls the normal processing at around 100fps, something 
like garbage collection in Java.


> We even have the effect of connections "starving", because the reader doesn't 
> even have the chance to select the connections in the "back" of the 
> connection set. This is another thing to be done for Collage: handle 
> connections fairly on the reader side, to avoid the starving effect.

This is implemented on Linux, but not yet on Windows (see 
https://github.com/Eyescale/Collage/issues/38).

> This topic is very interesting for us as well, please keep me updated. 

Indeed :)

> P.S.: Maybe you're not even having a software issue. We had a similar problem 
> at a customer (twice by now). After quite some analysis we identified the 
> switch as one of the problems, and a virus scan with "intrusion prevention" 
> was deep scanning all the packets and therefor stalling network traffic. But 
> I guess, you're not running windows, are you? 

Yes, as said above I would lean in this direction of investigation.

> What I am observing is best described as jittering or frame stuttering, 
> happening about every 1 second. The frame rate will drop from 100s of FPS 
> down to 1-2 very briefly and then recover, only to happen soon there after. 
> This is not related to rendering complexity (it happens with very simple 
> scenes and also with the various eq samples). 
> 
> I did some profiling on the AppNode driving the cluster in order to narrow 
> down the source of the issue. I am noticing hotspots in 
> co::LocalNode::_runReceiverThread (38.17% of all samples). In particular, 
> there seems to be a bunch of time spent within co::LocalNode::_handleData 
> (26.6% of all CPU time) and approximately 12.7% for the call to 
> ConnectionSet::select() within the same function (_runReceiverThread). The 
> second hotspot that I've noticed is in the ServerThread::run() function and 
> more specifically in _cmdStartFrame() (roughly 25% of CPU time spent there).

I don't think basic profiling will find the issue. Can you somehow get a 
profile when the stall happens, to see what is going on there?

The times look about right. The receiver thread + handle data is likely your 
bottleneck, as Carsten said above. With 70 clients they get quite a bit of 
load, but since you are running at >60FPS normally I would not worry yet.

The startFrame is also not surprising. There has been no optimization done on 
that code at all, since so far it has never shown up as a bottleneck. The whole 
traversal and task generation code can however be optimized by caching 
information, once the hotspots are identified.

> Meanwhile, I'd love to get people's input on the above! 

I would really like to find out what the cause for the stall is. It seems that 
normally you are fine, so a one second pause seems to have a special cause like 
an external influence or strange code path.


HTH,

Stefan.



signature.asc (858 bytes) 
<http://software.1713.n2.nabble.com/attachment/7585933/0/signature.asc>




--
View this message in context: 
http://software.1713.n2.nabble.com/Jittery-performance-with-large-cluster-tp7585928p7585933.html
Sent from the Equalizer - Parallel Rendering mailing list archive at Nabble.com.

_______________________________________________
eq-dev mailing list
[email protected]
http://www.equalizergraphics.com/cgi-bin/mailman/listinfo/eq-dev
http://www.equalizergraphics.com

Reply via email to