Just to follow up on Lee's mail.  He and I have been looking at logs
from his machine, where even with just one server and one client
things are much too slow.  pvfs2-mkdir is one example that takes way
too long, on the order of 1--2 seconds.  The trace he attached here
from the server shows a mkdir that takes 0.77 sec.  These delays
only show up using the IB interface---with TCP things appear to
proceed at normal speed.

I cannot figure out what is going on with IB to cause this, so was
hoping someone else might recognize what might be happening.  There
are burps in the trace that are always exact multiples of 10 ms.
There are 10 ms timeouts specified in job.c, but I'm grasping at why
that would be related.  I haven't ruled out that the clock is just
super granular on these machines for some reason.

Here's one 10 ms delay between two PINT_ACCESS_DEBUG log entries:

[EMAIL PROTECTED] wrote on Thu, 25 May 2006 14:52 -0500:
> [D 14:46:05.663257] (0x81aba0) getconfig (prelude sm) state: req_sched
> [A 14:46:05.663257] [EMAIL PROTECTED]:53361 H=0 S=0x81aba0: getconfig: request
> [A 14:46:05.673257] [EMAIL PROTECTED]:53361 H=0 S=0x81aba0: getconfig: start
> [D 14:46:05.673257] (0x81aba0) getconfig (prelude sm) state: getattr_if_needed

These entries come from prelude_req_sched and
prelude_getattr_if_needed, adjacent steps in the prelude state
machine.  Between those two operations is only:
    BMI_set_info:  some locks and an increment
    job_req_sched_post:  should return immediately, no?

Everything proceeds zippily until later:

> [D 14:46:05.673257] (0x81cee0) mkdir state: create
> [D 14:46:05.673257]  creating dspace on coll_id 1653737323
> [D 14:46:05.673257]  directory handle will be in starting hrange[0] 
> 4-2147483650
>[ D 14:46:05.673257] [1 extents] -- new_handle is 1055489 (cur_extent is 4 - 
>2147483650)
> [D 14:46:05.673257] dbpf_attr_cache_insert: inserting 1055489 (k_size is 0 | 
> b_size is 0)
> [D 14:46:05.763259] db SYNC called servicing op type DSPACE_CREATE
> [D 14:46:05.763259] (0x81cee0) mkdir state: setattrib

There's one function call in here: job_trove_dspace_create.  How could it
possibly take 90 ms?

Some more:

> [D 14:46:05.763259]  *** new directory has handle 1055489
> [D 14:46:05.763259]  setting directory version to 1148586365
>       mtime is 0
> [D 14:46:05.763259] dbpf_cache_elem_lookup: cache elem matching 1055489 
> returned (num_elems=2)
> [D 14:46:05.763259] Updating cached attributes for key 1055489
> [D 14:46:05.833261] db SYNC called servicing op type DSPACE_SETATTR
> [D 14:46:05.833261] (0x81cee0) mkdir state: create_dirdata_dspace

70 ms.

Here's a big one:

> [D 14:46:05.833261] dbpf_attr_cache_insert: inserting 1055485 (k_size is 0 | 
> b_size is 0)
> [D 14:46:05.933263] db SYNC called servicing op type DSPACE_CREATE

100 ms.

There's a few more of these down in dbpf land, which I don't grok
too well.  You'll see more entries down in the crdirent call which
does some of the same dbpf operations, with the same delays.

Any guesses?

                -- Pete
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to