On Tue, 2008-11-04 at 09:06 -0800, Kurt Dillen wrote:
> 
> Some more information about the environment:
> 
> - Lustre clients are all vmware virtual systems
> - Lustre Farm are all vmware virtual systems

Hrm.  That is a bit of a red flag right there.

> the errors I see are the following:
> 
> LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
> 0, status -5, desc ffff8100e5dca000
> LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
> 0, status -5, desc ffff8100e519e000
> LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
> 0, status -5, desc ffff8100e4e0a000
> LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
> 0, status -5, desc ffff8100e86b1bc0
> LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
> 0, status -5, desc ffff8100e79fe5c0
> LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
> 0, status -5, desc ffff8100e70a88c0
> LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
> 0, status -5, desc ffff8100e7081280
> LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
> 0, status -5, desc ffff8100e6d6d5c0
> LustreError: 3428:0:(client.c:975:ptlrpc_expire_one_request()) @@@
> timeout (sent at 1225816920, 100s ago)  [EMAIL PROTECTED] x17940/t0
> o4->[EMAIL PROTECTED]@tcp:28 lens 384/352 ref 2 fl Rpc:/
> 0/0 rc 0/-22
> Lustre: lustre-OST0005-osc-ffff8100e8551800: Connection to service
> lustre-OST0005 via nid [EMAIL PROTECTED] was lost; in progress
> operations using this service will wait for recovery to complete.
> Lustre: lustre-OST0005-osc-ffff8100e8551800: Connection restored to
> service lustre-OST0005 using nid [EMAIL PROTECTED]

These are just regular timeouts with nothing really to explain them.  A
detailed log analysis of all of your server logs (not something we can
do here on lustre-discuss) might yield more but I have suspicions about
your vmware-farm set up.  Running VMs, all competing for the same host
resources makes the environment unpredictable.

I'm not sure if you are using host-only or bridged networking but my
(now quite historic) experience with running lots of vmware machines on
a single piece of hardware is that the host-only network is less than
robust and the memory rquirements of running many VMs on a single
machine are demanding.  Additionally, if you have many OSTs all sharing
the same physical disk, you will have further contention there.
Timeouts are not surprising.

I would also encourage you to try 1.6.6 now that it is out.  I would
also encourage you to get some baseline performance metrics of all of
this virtual hardware with our iokit.

b.

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to