Le 14/09/2012 07:48, Siegmar Gross a écrit : > I have installed hwloc-1.5 on our systems and get the following output > when I run "lstopo" on a Sun Server M4000 (two quad-core processors with > two hardware-threads each). > > rs0 fd1026 101 lstopo > Machine (32GB) + NUMANode L#0 (P#1 32GB) > Socket L#0 > Core L#0 > PU L#0 (P#0) > PU L#1 (P#1) > Core L#1 > PU L#2 (P#2) > PU L#3 (P#3) > Core L#2 > PU L#4 (P#4) > PU L#5 (P#5) > Core L#3 > PU L#6 (P#6) > PU L#7 (P#7) > Socket L#1 > Core L#4 > PU L#8 (P#8) > PU L#9 (P#9) > Core L#5 > PU L#10 (P#10) > PU L#11 (P#11) > Core L#6 > PU L#12 (P#12) > PU L#13 (P#13) > Core L#7 > PU L#14 (P#14) > PU L#15 (P#15) > > When I run the command on a Sun Ultra 45 with two single core processors > I get the following output. > > tyr fd1026 116 lstopo > Machine (4096MB) > NUMANode L#0 (P#2 2048MB) + Socket L#0 + Core L#0 + PU L#0 (P#0) > NUMANode L#1 (P#1 2048MB) + Socket L#1 + Core L#1 + PU L#1 (P#1) > > > First question: Why reports "lstopo" two NUMA nodes on a Sun Ultra and > only one NUMA node on the M4000 although both machines are equipped > with two processors and both machines are running Solaris 10?
Depending on the architecture, you may have one NUMA node containing multiple processor sockets (old x86 machines for instance), one NUMA node per socket (many modern processors), or even multiple NUMA nodes per socket (some AMD processor). I am not familiar enough with Sparc processors to compare, but I would bet that some exist in the first and second model too. Google has some links to a patch adding NUMA support for the Ultra 45 in Opensolaris, so the second output would be OK. And people say that the lgroup utility confirms that the M4000 is not NUMA (which means the first output would be right). > I get the following error when I try to bind a process to a core > on the M4000 machine. > > rs0 fd1026 104 hwloc-bind socket:0.core:0 -l date > hwloc_set_cpubind 0x00000003 failed (errno 18 Cross-device link) > Fri Sep 14 07:37:14 CEST 2012 > > > I can use the following command which works for all 16 hardware threads. > > rs0 fd1026 105 hwloc-bind pu:0 -l date > Fri Sep 14 07:38:37 CEST 2012 On Solaris, you can't bind to an entire core if it contains multiple threads. You have to bind to a single thread (a PU). When each core contains a single thread, you're lucky :) > Second question: How can I find out which bindings are allowed when > I know the output from "lstopo"? I have no idea why I get "errno 18 > Cross-device link" on the M4000. That's something we need to think about. We were aware of the limitation but we didn't really think about making the user aware of it yet. We have a function that returns some information about what hwloc supports on the current platform. It could be extended. But if we want to be feature complete, we'd need to be able to say: 1) binding works for random sets of objects (even objects of different kinds) 2) binding works for a single object of this type 3) binding works on random sets of objects of the same type Solaris always has (2) with type=PU (or type=Core if each Core has one PU) and optionally has (3) for NUMA node. Another solution would be to document that this specific errno means that you should try to bind to something smaller, likely a PU (those are always supported when binding is supported). Keep in mind that we recommend that you run hwloc_bitmap_singlify() before binding. This avoids problems with tasks moving from one PU to another inside the whole binding. The drawback of singlify or binding to smaller on failure is that you have to manually distributes tasks if several of them want the same binding: Two tasks bound to a whole dual-thread single-core will be well distributed by the OS. Two tasks bound to a single thread within this core require you to make sure they are not bound to the same thread. Brice