Dear pth-users,

I am using pth for a scientific computing application. The basic idea is that I have several large blocks of floating point data that are operated on by each thread then information is exchanged and then operations continue. I need this to perform as fast as possible. I had originally written my code using "Matrix-based explicit dispatching between small units of execution" that I had written myself. This however was difficult to maintain and error prone so I switched to pth. The problem is that for cases in which the blocks of data are not very large, the switching between threads takes up a significant portion of the computation time.

For example, if I have roughly 2 blocks of 1000 floats and do roughly 10 floating operations on each using 1 pth thread for each block before switching, the code takes roughly 3000 s versus when I use my own threading it takes 1300 s. What I would like to know is if there is anyway to configure pth to make this faster. The basics of how I am using pth is as follows:

This is the block that creates the threads. thread_go is the function that will manipulate the data, myGo(b) is the pointer to the data. This is an oversimplification, just to give the idea.

    pth_attr_set(attr, PTH_ATTR_JOINABLE, true);
    for (int b=0;b<myblock;++b) {
threads(b) = pth_spawn(attr, thread_go,static_cast<void *> (&myGo(b)));
    }

    for (int b=0;b<myblock;++b)
        pth_join(threads(b),NULL);

    // END OF PROGRAM


The switching occurs when the threads call the routines "wait_for_slot" and "notify_change". These basically use a map from unique integers for each message to booleans of true false on whether the message has been received or not. As far as I can tell from profiling, this is not the slow part of the process.

        void waitforslot(int msgid, bool set) {
            std::map<int,bool>::iterator mi;
            pth_mutex_acquire(&list_mutex,false,NULL);

            while(message_list[msgid] != set) {
                pth_cond_await(&list_change, &list_mutex, NULL);
            }

            pth_mutex_release(&list_mutex);
        }

        void notify_change(int msgid, bool set) {
            pth_mutex_acquire(&list_mutex,false,NULL);

            message_list[msgid] = set;
            pth_cond_notify(&list_change, true);

            pth_mutex_release(&list_mutex);
        }


Lastly, I have profiled the code and for this problem size, a significant amount of the time is spent in the routine: __pth_sched_eventmanager with some spent in __pth_scheduler. If I break out the system routines separately then the system routine shandler is the main culprit. When I ran this profile, pth was compiled with -O2. This is on powerpc Mac OS X platforms (Both G4 and G5).

I haven't really looked at the details of pth, but what I am wondering is if there are any changes I can make to speed things up. For this application, signal handling is not important so that may be one area where I have some advantages.

Thanks for your help,

Brian Helenbrook
Associate Professor
362 CAMP
Mech. and Aero. Eng. Dept.
Clarkson University
Potsdam, NY 13699-5725

P.S. I am not sure if the list manager allows attachments, but I have taken a screen shot of the main hot spots in __pth_sched_eventmanager which was found using the Mac profiling tool "Shark". If you are interested I can send it along.

Reply via email to