optimize for scientific computation

Brian Helenbrook Mon, 15 Oct 2007 12:12:42 -0700

Dear pth-users,

I am using pth for a scientific computing application. The basicidea is that I have several large blocks of floating point data thatare operated on by each thread then information is exchanged and thenoperations continue. I need this to perform as fast as possible. Ihad originally written my code using "Matrix-based explicitdispatching between small units of execution" that I had writtenmyself. This however was difficult to maintain and error prone so Iswitched to pth. The problem is that for cases in which the blocksof data are not very large, the switching between threads takes up asignificant portion of the computation time.

For example, if I have roughly 2 blocks of 1000 floats and do roughly10 floating operations on each using 1 pth thread for each blockbefore switching, the code takes roughly 3000 s versus when I use myown threading it takes 1300 s. What I would like to know is ifthere is anyway to configure pth to make this faster. The basics ofhow I am using pth is as follows:

This is the block that creates the threads. thread_go is thefunction that will manipulate the data, myGo(b) is the pointer to thedata. This is an oversimplification, just to give the idea.


    pth_attr_set(attr, PTH_ATTR_JOINABLE, true);
    for (int b=0;b<myblock;++b) {

threads(b) = pth_spawn(attr, thread_go,static_cast<void *>(&myGo(b)));

    }

    for (int b=0;b<myblock;++b)
        pth_join(threads(b),NULL);

    // END OF PROGRAM

The switching occurs when the threads call the routines"wait_for_slot" and "notify_change". These basically use a map fromunique integers for each message to booleans of true false on whetherthe message has been received or not. As far as I can tell fromprofiling, this is not the slow part of the process.


        void waitforslot(int msgid, bool set) {
            std::map<int,bool>::iterator mi;
            pth_mutex_acquire(&list_mutex,false,NULL);

            while(message_list[msgid] != set) {
                pth_cond_await(&list_change, &list_mutex, NULL);
            }

            pth_mutex_release(&list_mutex);
        }

        void notify_change(int msgid, bool set) {
            pth_mutex_acquire(&list_mutex,false,NULL);

            message_list[msgid] = set;
            pth_cond_notify(&list_change, true);

            pth_mutex_release(&list_mutex);
        }

Lastly, I have profiled the code and for this problem size, asignificant amount of the time is spent in the routine:__pth_sched_eventmanager with some spent in __pth_scheduler. If Ibreak out the system routines separately then the system routineshandler is the main culprit. When I ran this profile, pth wascompiled with -O2. This is on powerpc Mac OS X platforms (Both G4and G5).

I haven't really looked at the details of pth, but what I amwondering is if there are any changes I can make to speed things up.For this application, signal handling is not important so that may beone area where I have some advantages.


Thanks for your help,

Brian Helenbrook
Associate Professor
362 CAMP
Mech. and Aero. Eng. Dept.
Clarkson University
Potsdam, NY 13699-5725

P.S. I am not sure if the list manager allows attachments, but I havetaken a screen shot of the main hot spots in __pth_sched_eventmanagerwhich was found using the Mac profiling tool "Shark". If you areinterested I can send it along.

optimize for scientific computation

Reply via email to