Hi all and Eddie. > I've tried running some jobs on a MacPro with Dual quad core processors > which are hyperthreaded (the OS recognizes 16 cores, also, I'm still using > Trilinos 9.0.3 and FiPy 2.02). I typically see pretty linear scaling when > running jobs up to the number of physical cores, but I see no improvement, > or even an increase in total job time if I go above that limit (e.g. 8 cores > is ~half the time of 4, but 16 is no better or perhaps worse than 8) for a > 256x256 grid.
Does "pretty linear" mean ~8 times faster when you run with "mpirun -np 8", i.e. efficiency of each real core is close to 100% ? > For larger systems (1024x1024) I've seen pretty good performance on our > cluster at 64 or even 128 cores (with 8 cores per node and no hyperthreading > there). Here, the performance improvement seems to drop off more slowly > with an increase in cores compared to the transition from linear to no > improvement when attempting to use the hyperthreaded cores on our desktop > machines. I haven't tried yet to run on cluster, but what you are saying means that communication through other than motherboard means does not really affect your problem, i.e. your problem doesn't have too much of messaging. I'm starting to doubt now if I have 8 real cores or 4 hyperthreaded. I just took the number of cores from the top command. I'll find out about it, but 8 gives overall performance increase compared to 4 in my case. > Based on your experience, it sounds like the aspect ratio of the system can > be pretty important which is interesting as my problem will probably require > non-square domains. I can let you know when I get around to testing this > out, and thanks for sharing! Actually, in my case the cell were becoming *more square* when I was increasing the cell number in one dimension. So I have non-square cell. After this post I decided to make "square cells" with applying some re-scalings, which are actually very inconvenient in my specific problem, and repeated some tests. It didn't show any performance efficiency increase, instead I couldn't get more than 50-60% efficiency per core as I had to keep cells square so had to increase dimension along which the communication is done as well. So, the cell geometry doesn't play a huge role. But to be sure I'd like to ask developers here if FiPy is working differently and if solution is strongly affected when one side of the cell is lets say 10 times the other? In all, I now think that the performance efficiency strongly depends on the problem type. I have quite complicated problem with several convection terms, sources and also diffusion term, so communicating information between the nodes will always take some noticeable time compared to calculations. Eddie, did you make your benchmark with some problem you are solving or some "example problem" from the Manual? If it is the one from or similar to the Manual, could you share the script or say which "example problem" you used? We would make some similar benchmarks to finally understand what governs the performance efficiency. If we don't find same results that would mean some problems on our sides. Thanks to all! Cheers, Igor.
