Dear pth-users,
I am using pth for a scientific computing application. The basic
idea is that I have several large blocks of floating point data that
are operated on by each thread then information is exchanged and then
operations continue. I need this to perform as fast as possible. I
had originally written my code using "Matrix-based explicit
dispatching between small units of execution" that I had written
myself. This however was difficult to maintain and error prone so I
switched to pth. The problem is that for cases in which the blocks
of data are not very large, the switching between threads takes up a
significant portion of the computation time.
For example, if I have roughly 2 blocks of 1000 floats and do roughly
10 floating operations on each using 1 pth thread for each block
before switching, the code takes roughly 3000 s versus when I use my
own threading it takes 1300 s. What I would like to know is if
there is anyway to configure pth to make this faster. The basics of
how I am using pth is as follows:
This is the block that creates the threads. thread_go is the
function that will manipulate the data, myGo(b) is the pointer to the
data. This is an oversimplification, just to give the idea.
pth_attr_set(attr, PTH_ATTR_JOINABLE, true);
for (int b=0;b<myblock;++b) {
threads(b) = pth_spawn(attr, thread_go,static_cast<void *>
(&myGo(b)));
}
for (int b=0;b<myblock;++b)
pth_join(threads(b),NULL);
// END OF PROGRAM
The switching occurs when the threads call the routines
"wait_for_slot" and "notify_change". These basically use a map from
unique integers for each message to booleans of true false on whether
the message has been received or not. As far as I can tell from
profiling, this is not the slow part of the process.
void waitforslot(int msgid, bool set) {
std::map<int,bool>::iterator mi;
pth_mutex_acquire(&list_mutex,false,NULL);
while(message_list[msgid] != set) {
pth_cond_await(&list_change, &list_mutex, NULL);
}
pth_mutex_release(&list_mutex);
}
void notify_change(int msgid, bool set) {
pth_mutex_acquire(&list_mutex,false,NULL);
message_list[msgid] = set;
pth_cond_notify(&list_change, true);
pth_mutex_release(&list_mutex);
}
Lastly, I have profiled the code and for this problem size, a
significant amount of the time is spent in the routine:
__pth_sched_eventmanager with some spent in __pth_scheduler. If I
break out the system routines separately then the system routine
shandler is the main culprit. When I ran this profile, pth was
compiled with -O2. This is on powerpc Mac OS X platforms (Both G4
and G5).
I haven't really looked at the details of pth, but what I am
wondering is if there are any changes I can make to speed things up.
For this application, signal handling is not important so that may be
one area where I have some advantages.
Thanks for your help,
Brian Helenbrook
Associate Professor
362 CAMP
Mech. and Aero. Eng. Dept.
Clarkson University
Potsdam, NY 13699-5725
P.S. I am not sure if the list manager allows attachments, but I have
taken a screen shot of the main hot spots in __pth_sched_eventmanager
which was found using the Mac profiling tool "Shark". If you are
interested I can send it along.