Hi Dylan,
> Hi, Rafael,
>
> Could you provide the output when you run with `QT_INFO=1` set in the
> environment? Assuming you¹re using hwloc, could you also provide the
> output of `lstopo -v`?
Thanks, yes I'm using hwloc, I have put the output of lstopo and lstopo -v at
the end.
The output when executing with QT_INFO=1 is:
QTHREADS: Using 16 Shepherds
QTHREADS: Using 1 Workers per Shepherd
QTHREADS: Guard Pages Enabled
QTHREADS: Using 8388608 byte stack size.
I have debugged it a little more, just in case you find it relevant.
BTW, I don't mind if sleep consumes cpu, as I don't use it in my code.
If instead of the simplest example we use:
use Time;
cobegin {
sleep(40);
sleep(40);
}
Then, if compiled with gasnet, it uses 400%, 4 cores, 3 (cores 0,1 and 2) using
100% user time and the other (core 8) about 35% user time and 65% system time.
The 0 core is executing the nemesis task scheduler:
#0 0x000000000052fef3 in qt_internal_NEMESIS_dequeue ()
#1 0x000000000053035f in qt_scheduler_get_thread ()
#2 0x000000000052a9d1 in qthread_master ()
#3 0x0000000000000000 in ?? ()
The process that is using 65% 35% is executing in the 8 core (the first of
socket 1):
#0 0x00007fb52da0b7f7 in sched_yield () from /lib64/libc.so.6
#1 0x000000000052490c in chpl_task_yield ()
#2 0x0000000000459c6f in polling ()
#3 0x00007fb52e47a806 in start_thread () from /lib64/libpthread.so.0
#4 0x00007fb52da23e8d in clone () from /lib64/libc.so.6
#5 0x0000000000000000 in ?? ()
The sleeps execute both the sleep and a yield, in cores 1 and 2, both the same:
#0 0x0000000000534e0d in qt_swapctxt ()
#1 0x000000000052e628 in qthread_back_to_master ()
#2 0x000000000052d5f1 in qthread_yield_ ()
#3 0x0000000000525898 in chpl_task_sleep ()
#4 0x000000000043404a in cobegin_fn_chpl2 (_cobeginCount_chpl=0x7fb25badbf00)
at sleep.chpl:4
#5 0x000000000043411c in wrapcobegin_fn_chpl2 (c_chpl=0x7fb29817f8d0) at
sleep.chpl:2
#6 0x0000000000525582 in chapel_wrapper ()
#7 0x000000000052d45c in qthread_wrapper ()
#8 0x0000000000000000 in ?? ()
If compiled without gasnet then this program consumes 3 cores, 100% user each
of them, without the thread in core 8, using only the cores 0,1 and 2.
The problem in my real program is that this fourth thread unbalances the work
made by the threads, making the program to go slower instead of faster when
more cores are added.
Thank you very much,
Rafael
Output of lstopo:
Machine (64GB total)
NUMANode L#0 (P#0 32GB) + Socket L#0 + L3 L#0 (20MB)
L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1)
L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#2)
L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#3)
L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#4)
L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#5)
L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#6)
L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#7)
NUMANode L#1 (P#1 32GB) + Socket L#1 + L3 L#1 (20MB)
L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#8)
L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#9)
L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU L#10
(P#10)
L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU L#11
(P#11)
L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 + PU L#12
(P#12)
L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 + PU L#13
(P#13)
L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 + PU L#14
(P#14)
L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 + PU L#15
(P#15)
And lstopo -v :
Machine (P#0 total=67073228KB DMIProductName="ProLiant SL230s Gen8 "
DMIProductVersion= DMIBoardVendor=HP DMIBoardName= DMIBoardVersion=
DMIBoardAssetTag=" " DMIChassisVendor=HP DMIChassisType=25
DMIChassisVersion= DMIChassisAssetTag=" " DMIBIOSVendor=HP
DMIBIOSVersion=P75 DMIBIOSDate=02/10/2014 DMISysVendor=HP Backend=Linux
OSName=Linux OSRelease=3.0.101-0.35-default OSVersion="#1 SMP Wed Jul 9
11:43:04 UTC 2014 (c36987d)" HostName=cn06 Architecture=x86_64
hwlocVersion=1.10.1)
NUMANode L#0 (P#0 local=33518800KB total=33518800KB)
Socket L#0 (P#0 CPUVendor=GenuineIntel CPUFamilyNumber=6 CPUModelNumber=45
CPUModel="Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz")
L3Cache L#0 (size=20480KB linesize=64 ways=20)
L2Cache L#0 (size=256KB linesize=64 ways=8)
L1dCache L#0 (size=32KB linesize=64 ways=8)
L1iCache L#0 (size=32KB linesize=64 ways=8)
Core L#0 (P#0)
PU L#0 (P#0)
L2Cache L#1 (size=256KB linesize=64 ways=8)
L1dCache L#1 (size=32KB linesize=64 ways=8)
L1iCache L#1 (size=32KB linesize=64 ways=8)
Core L#1 (P#1)
PU L#1 (P#1)
L2Cache L#2 (size=256KB linesize=64 ways=8)
L1dCache L#2 (size=32KB linesize=64 ways=8)
L1iCache L#2 (size=32KB linesize=64 ways=8)
Core L#2 (P#2)
PU L#2 (P#2)
L2Cache L#3 (size=256KB linesize=64 ways=8)
L1dCache L#3 (size=32KB linesize=64 ways=8)
L1iCache L#3 (size=32KB linesize=64 ways=8)
Core L#3 (P#3)
PU L#3 (P#3)
L2Cache L#4 (size=256KB linesize=64 ways=8)
L1dCache L#4 (size=32KB linesize=64 ways=8)
L1iCache L#4 (size=32KB linesize=64 ways=8)
Core L#4 (P#4)
PU L#4 (P#4)
L2Cache L#5 (size=256KB linesize=64 ways=8)
L1dCache L#5 (size=32KB linesize=64 ways=8)
L1iCache L#5 (size=32KB linesize=64 ways=8)
Core L#5 (P#5)
PU L#5 (P#5)
L2Cache L#6 (size=256KB linesize=64 ways=8)
L1dCache L#6 (size=32KB linesize=64 ways=8)
L1iCache L#6 (size=32KB linesize=64 ways=8)
Core L#6 (P#6)
PU L#6 (P#6)
L2Cache L#7 (size=256KB linesize=64 ways=8)
L1dCache L#7 (size=32KB linesize=64 ways=8)
L1iCache L#7 (size=32KB linesize=64 ways=8)
Core L#7 (P#7)
PU L#7 (P#7)
NUMANode L#1 (P#1 local=33554428KB total=33554428KB)
Socket L#1 (P#1 CPUVendor=GenuineIntel CPUFamilyNumber=6 CPUModelNumber=45
CPUModel="Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz")
L3Cache L#1 (size=20480KB linesize=64 ways=20)
L2Cache L#8 (size=256KB linesize=64 ways=8)
L1dCache L#8 (size=32KB linesize=64 ways=8)
L1iCache L#8 (size=32KB linesize=64 ways=8)
Core L#8 (P#0)
PU L#8 (P#8)
L2Cache L#9 (size=256KB linesize=64 ways=8)
L1dCache L#9 (size=32KB linesize=64 ways=8)
L1iCache L#9 (size=32KB linesize=64 ways=8)
Core L#9 (P#1)
PU L#9 (P#9)
L2Cache L#10 (size=256KB linesize=64 ways=8)
L1dCache L#10 (size=32KB linesize=64 ways=8)
L1iCache L#10 (size=32KB linesize=64 ways=8)
Core L#10 (P#2)
PU L#10 (P#10)
L2Cache L#11 (size=256KB linesize=64 ways=8)
L1dCache L#11 (size=32KB linesize=64 ways=8)
L1iCache L#11 (size=32KB linesize=64 ways=8)
Core L#11 (P#3)
PU L#11 (P#11)
L2Cache L#12 (size=256KB linesize=64 ways=8)
L1dCache L#12 (size=32KB linesize=64 ways=8)
L1iCache L#12 (size=32KB linesize=64 ways=8)
Core L#12 (P#4)
PU L#12 (P#12)
L2Cache L#13 (size=256KB linesize=64 ways=8)
L1dCache L#13 (size=32KB linesize=64 ways=8)
L1iCache L#13 (size=32KB linesize=64 ways=8)
Core L#13 (P#5)
PU L#13 (P#13)
L2Cache L#14 (size=256KB linesize=64 ways=8)
L1dCache L#14 (size=32KB linesize=64 ways=8)
L1iCache L#14 (size=32KB linesize=64 ways=8)
Core L#14 (P#6)
PU L#14 (P#14)
L2Cache L#15 (size=256KB linesize=64 ways=8)
L1dCache L#15 (size=32KB linesize=64 ways=8)
L1iCache L#15 (size=32KB linesize=64 ways=8)
Core L#15 (P#7)
PU L#15 (P#15)
depth 0: 1 Machine (type #1)
depth 1: 2 NUMANode (type #2)
depth 2: 2 Socket (type #3)
depth 3: 2 L3Cache (type #4)
depth 4: 16 L2Cache (type #4)
depth 5: 16 L1dCache (type #4)
depth 6: 16 L1iCache (type #4)
depth 7: 16 Core (type #5)
depth 8: 16 PU (type #6)
relative latency matrix between NUMANodes (depth 1) by logical indexes:
index 0 1
0 1,000 2,000
1 2,000 1,000
> On 5/31/15, 10:53 AM, "Rafael Larrosa Jiménez" <[email protected]> wrote:
> >Hi,
> >
> >I have found a problem while using qthreads with gastnet, to make it
> >simpler I
> >will consider only one locale, but when more are used the problem is
> >exactly
> >the same.
> >
> >The platform is a cluster of locales (each with 16 cores and 64 GBytes of
> >RAM). More precisely, each locale has two sockets with 8 cores each and
> >Hyper-Threading has been disabled.
> >
> >The Chapel program has just these two lines:
> >----
> >use Time;
> >sleep(50);
> >----
> >
> >When compiled with gasnet, two threads use 100% of CPU while the sleep is
> >executed, with one of them using about 34.9%user, 62.5%sys, and the other
> >100%
> >user. I ended up checking this simple chapel code because for another
> >(more
> >realistic) application, I obtained some scalability when moving from 1 to
> >8
> >threads per locale, but execution time increase if I use more than 8
> >threads
> >although, as I said, the locale has 16 cores.
> >
> >Any thoughts?
> >
> >Thank you very much in advance,
> >
> >Rafael
--
Rafael Larrosa Jiménez
Centro de Supercomputación y Bioinformática - http://www.scbi.uma.es
Universidad de Málaga
EMAIL: [email protected] Edificio de Bioinnovación
TELEF: + 34951952788 C/ Severo Ochoa 34
FAX : +34951952792 Parque Tecnológico de Andalucía
29590 Málaga (SPAIN)
------------------------------------------------------------------------------
_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers