Re: [[email protected]: Re: [Pvfs2-developers] Re: [Pvfs2-users] PVFS2 on Infiniband]

Pete Wyckoff Thu, 01 Jun 2006 15:39:31 -0700

[EMAIL PROTECTED] wrote on Wed, 31 May 2006 15:57 -0500:
> Ok Pete.  Here ya go.
> 
> tcp with    syncmeta + syncdata = pvfs2-server.log.tcp
> tcp without syncmeta + syncdata = pvfs2-server.log.tcp_nosync
> ib  without syncmeta + syncdata = pvfs2-server.log.slowib_nosync
> 
> As expected, the nosync parameter didn't seem to make any noticable 
> difference.  TCP is still much faster.  In addition, just for grins I 
> kept an xterm open with "top" running during the creation of those 
> files.  When doing the ib testing (both with the syncs enabled and 
> disabled) the pvfs2-server process jumped from taking up 0% cpu to 20% 
> cpu as soon as the pvfs2-mkdir command was issued and stayed that way 
> for a couple of top refreshes after the command completed.  When doing 
> the tcp testing the pvfs2-server process never got high enough for me to 
> even see it in top's output.


Thanks for the data.  It's interesting, but I'm not sure if I have
any direct remedy for you.  Here's a summary of the times from mkdir
start through crdirent end.

  110 ms  tcp with    syncmeta + syncdata
    0 ms  tcp without syncmeta + syncdata

  760 ms  ib  with    syncmeta + syncdata
  120 ms  ib  without syncmeta + syncdata

All values are +/- 10 ms due to the apparent clock granularity on
your machine.

There is clearly something up.  Even without the sync, IB has a
few pokey delays; 90 ms before dbpf_dspace_create_op_svc is the
largest one.  One interesting thing that happens between those two
lines is a wakeup of the dbpf thread.

With that, and the previous observations that db SYNC was taking a
long time, and your notice of the high CPU usage in "top", I'm
thinking that the IB device polling thread is starving the other
threads in your system.  Why this only happens on your machine, I
don't know.

But, I was thinking about going to an event-based rather than
polling interface to see what the performance tradeoffs are.  Each
network operation will consume some tens of microseconds for the
interrupt and wakeup, but there will be no polling overhead.  It
will take a while to get this to work, though.

If you want to try a couple of patches, just for kicks, we can maybe
hack around this alleged starvation problem.  First, we can toss in
a sched_yield() and see if it magically gets the kernel to make the
dbpf thread run instead.  Near the bottom of the function
BMI_ib_testcontext() in src/io/bmi/bmi_ib/ib.c, there's this code:

    *outcount = n;
    if (n > 0) {
        gettimeofday(&last_action, 0);
    } else if (max_idle_time > 0) {
        /*
         * Block for up to max_idle_time to avoid spinning from BMI.  In the
         * server, instead of sleeping, watch the accept socket for something
         * new.  No way to blockingly poll in standard VAPI.
         */
        if (ib_device->listen_sock >= 0) {
            struct timeval now;
            gettimeofday(&now, 0);
            now.tv_sec -= last_action.tv_sec;
            if (now.tv_sec == 1) {
                now.tv_usec -= last_action.tv_usec;
                if (now.tv_usec < 0)
                    --now.tv_sec;
            }
            if (now.tv_sec > 0)  /* spin for 1 sec following any activity */
                if (ib_tcp_server_block_new_connections(max_idle_time))
                    gettimeofday(&last_action, 0);
        }
    }
    return 0;

Change the last part of that to:

    ...
            if (now.tv_sec > 0) {  /* spin for 1 sec following any activity */
                if (ib_tcp_server_block_new_connections(max_idle_time))
                    gettimeofday(&last_action, 0);
            } else {
                sched_yield();
            }
    ...

Doesn't affect anything here, but maybe magic happens on your end.

Second, we can disable polling in IB and always block for whatever
BMI asked, usually 10 ms.  Actually just for 1 ms since 10 ms is a
bit long for data streaming or lots of metadata ops.  Back out the
other change, and put in this instead of the whole big block quoted
above.

    *outcount = n;
    if (n == 0 && max_idle_time > 0 && ib_device->listen_sock >= 0) 
        ib_tcp_server_block_new_connections(1);  /* only 1 ms */
    return 0;

This second change makes things very slow here, but they run to
completion.

If either of these changes the slowib_nosync trace (and the slowib
with sync trace), we'll know this is definitely the problem.  If you
find anything different, send me more traces.  I'm getting pretty
good at reading them at this point.  :)

I'm not giving you diffs because there was a big checkin recently
that messed up all the line numbers.

                -- Pete
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Re: [[email protected]: Re: [Pvfs2-developers] Re: [Pvfs2-users] PVFS2 on Infiniband]

Reply via email to