Hello, we do see those messages (see below) on our OSTs when under heavy _read_ load (or when 60+ Jobs are trying to read data at approx the same time). The OSTs freezes and even console output is down to a few bytes the minute. After some time the OSTs do revocer.
How to interpret those messages to get a clue where to look further ? 1) Is the underlying Raid to slow to handle the amount of requests ? 1a) Is the MDS to slow to follow ? 2) Is it related to network issues (network too slow, too busy ??) ? 3) Would this be cured by upgrading to 1.8.x ? Status -5 looks to us like I/O errors. ~ # grep '5' /usr/include/*/*errno* /usr/include/asm-generic/errno-base.h:#define EIO 5 /* I/O error */ But together with the 3ware support we are pretty sure to have replaced all snipish disks and data transfer looks ok when not used by lustre (i.e. verifying up to 30-90MB/s/disk throughput). OSTs with 2GB RAM on 16port controller, 4GB RAM on 24port controller Raid6 (write once, read often archive system) lustre-1.6.6 vanilla-kernel 2.6.22.19 3ware 9650se (16 and 24port) latest 9.5.3 Version Seagate 31000340NS disks, HITACHI 1TB disks Thanks and Regards Heiko Dec 4 12:42:13 sadosrd24 LustreError: 4650:0:(events.c:372:server_bulk_callback()) event type 4, status -5, desc ffff810070c9e000 Dec 4 12:42:13 sadosrd24 LustreError: 4650:0:(events.c:372:server_bulk_callback()) event type 4, status -5, desc ffff810076b78000 Dec 4 12:42:13 sadosrd24 LustreError: 4650:0:(events.c:372:server_bulk_callback()) event type 4, status -5, desc ffff810071e88000 Dec 4 12:42:13 sadosrd24 LustreError: 4650:0:(events.c:372:server_bulk_callback()) event type 4, status -5, desc ffff810070c80000 Dec 4 12:42:13 sadosrd24 LustreError: 4650:0:(events.c:372:server_bulk_callback()) event type 4, status -5, desc ffff810076a80000 Dec 4 12:42:56 sadosrd24 LustreError: 4744:0:(ost_handler.c:882:ost_brw_read()) @@@ timeout on bulk PUT after 100+0s r...@ffff81007efa7e00 x7869690/t0 o3->eb2e7e64-c1d9-d1f6-8f9d-1ba9629ff...@net_0x20000c0a8106f_uuid:0/0 lens 384/336 e 0 to 0 dl 1259926976 ref 1 fl Interpret:/0/0 rc 0/0 Dec 4 12:42:56 sadosrd24 Lustre: 4744:0:(ost_handler.c:939:ost_brw_read()) scia-OST001d: ignoring bulk IO comm error with eb2e7e64-c1d9-d1f6-8f9d-1ba9629ff...@net_0x20000c0a8106f_uuid id 12345-192.168.16....@tcp - client will retry Dec 4 12:42:57 sadosrd24 LustreError: 4754:0:(ost_handler.c:882:ost_brw_read()) @@@ timeout on bulk PUT after 100+0s r...@ffff81007dedb400 x7869691/t0 o3->eb2e7e64-c1d9-d1f6-8f9d-1ba9629ff...@net_0x20000c0a8106f_uuid:0/0 lens 384/336 e 0 to 0 dl 1259926977 ref 1 fl Interpret:/0/0 rc 0/0 Dec 4 12:42:57 sadosrd24 Lustre: 4754:0:(ost_handler.c:939:ost_brw_read()) scia-OST001d: ignoring bulk IO comm error with eb2e7e64-c1d9-d1f6-8f9d-1ba9629ff...@net_0x20000c0a8106f_uuid id 12345-192.168.16....@tcp - client will retry Dec 4 12:43:00 sadosrd24 LustreError: 4757:0:(ost_handler.c:882:ost_brw_read()) @@@ timeout on bulk PUT after 100+0s r...@ffff81007e8de000 x7869700/t0 o3->eb2e7e64-c1d9-d1f6-8f9d-1ba9629ff...@net_0x20000c0a8106f_uuid:0/0 lens 384/336 e 0 to 0 dl 1259926980 ref 1 fl Interpret:/0/0 rc 0/0 Dec 4 12:43:00 sadosrd24 LustreError: 4757:0:(ost_handler.c:882:ost_brw_read()) Skipped 2 previous similar messages Dec 4 12:43:00 sadosrd24 Lustre: 4757:0:(ost_handler.c:939:ost_brw_read()) scia-OST001d: ignoring bulk IO comm error with eb2e7e64-c1d9-d1f6-8f9d-1ba9629ff...@net_0x20000c0a8106f_uuid id 12345-192.168.16....@tcp - client will retry Dec 4 12:43:00 sadosrd24 Lustre: 4757:0:(ost_handler.c:939:ost_brw_read()) Skipped 2 previous similar messages Dec 4 12:43:16 sadosrd24 LustreError: 4746:0:(ost_handler.c:882:ost_brw_read()) @@@ timeout on bulk PUT after 94+0s r...@ffff81006de99850 x7869735/t0 o3->eb2e7e64-c1d9-d1f6-8f9d-1ba9629ff...@net_0x20000c0a8106f_uuid:0/0 lens 384/336 e 0 to 0 dl 1259926996 ref 1 fl Interpret:/0/0 rc 0/0 Dec 4 12:43:16 sadosrd24 LustreError: 4746:0:(ost_handler.c:882:ost_brw_read()) Skipped 2 previous similar messages Dec 4 12:43:16 sadosrd24 Lustre: 4746:0:(ost_handler.c:939:ost_brw_read()) scia-OST001d: ignoring bulk IO comm error with eb2e7e64-c1d9-d1f6-8f9d-1ba9629ff...@net_0x20000c0a8106f_uuid id 12345-192.168.16....@tcp - client will retry Dec 4 12:43:16 sadosrd24 Lustre: 4746:0:(ost_handler.c:939:ost_brw_read()) Skipped 2 previous similar messages Dec 4 12:44:13 sadosrd24 Lustre: 4728:0:(ldlm_lib.c:538:target_handle_reconnect()) scia-OST001d: eb2e7e64-c1d9-d1f6-8f9d-1ba9629ff4c0 reconnecting _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
