Hi all,

we are trying to run Robinhood for our next Lustre Installation with 2.10 which 
is extremely slow.

I have also tested a fullscan with the same Robinhood client and server boxes 
of our old Lustre installation with 2.5 and it performs significantly better
(on the old Lustre installation the avg. speed was about 2000 entries/sec 
compared to few hundreds).

Just as a comparison the full scan of the new Lustre installation took 
relatively long...

> Fullscan of old Lustre installation took 3 days for 594M entries.
> Fullscan of new Lustre installation took 23 days for 533M entries.

I do not understand the cause of the bad performance of the fullscan on our 
next Lustre installation
(changelog processing is also slow).

So far no meta data performance or file server issues were detected.
The meta data operations were tested with 'lfs path2fid'/'ls' on each single 
MDS and file server synchronous write and read performance were also tested 
with few million small files from multiple clients in parallel.

Few numbers about our Lustre installations:

NEW INSTALLATION

Lustre Server Version: 2.10.6
Number of MDTs: 2
Number of OSTs: 294

OLD INSTALLATION

Lustre Version: 2.5
Number of MDTs: 2
Number of OSTs: 630

LUSTRE CLIENT
Tested with 2.10 and 2.12

SINGLE FOLDER SCAN EXAMPLE

In the following I post few lines of a Robinhood log of a folder scan on the 
Lustre new installation,
where each stats dump looks the same except the last dump stats, where the 
GET_FID stage has non 0 value:

2019/06/11 15:07:24 [13345/2] FS_Scan | Starting scan of /lustre/xxx
2019/06/11 15:07:24 [13345/2] FS_Scan | Notice: this is the first scan (DB is 
empty)
2019/06/11 15:22:24 [13345/3] STATS | ==================== Dumping stats at 
2019/06/11 15:22:24 =====================
2019/06/11 15:22:24 [13345/3] STATS | ======== General statistics =========
2019/06/11 15:22:24 [13345/3] STATS | Daemon start time: 2019/06/11 15:07:24
2019/06/11 15:22:24 [13345/3] STATS | Started modules: scan
2019/06/11 15:22:24 [13345/3] STATS | ======== FS scan statistics =========
2019/06/11 15:22:24 [13345/3] STATS | current scan interval = 18.5d
2019/06/11 15:22:24 [13345/3] STATS | scan is running:
2019/06/11 15:22:24 [13345/3] STATS |      started at : 2019/06/11 15:07:24 
(15.0min ago)
2019/06/11 15:22:24 [13345/3] STATS |      last action: 2019/06/11 15:22:24 
(00s ago)
2019/06/11 15:22:24 [13345/3] STATS |      progress   : 64932 entries scanned 
(0 errors)
2019/06/11 15:22:24 [13345/3] STATS |      inst. speed (potential):     35.92 
entries/sec (55.67 ms/entry/thread)
2019/06/11 15:22:24 [13345/3] STATS |      avg. speed  (effective):     72.15 
entries/sec (27.64 ms/entry/thread)
2019/06/11 15:22:24 [13345/3] STATS | ==== EntryProcessor Pipeline Stats ===
2019/06/11 15:22:24 [13345/3] STATS | Idle threads: 16
2019/06/11 15:22:24 [13345/3] STATS | Id constraints count: 0 (hash 
min=0/max=0/avg=0.0)
2019/06/11 15:22:24 [13345/3] STATS | Name constraints count: 0 (hash 
min=0/max=0/avg=0.0)
2019/06/11 15:22:24 [13345/3] STATS | Stage              | Wait | Curr | Done | 
    Total | ms/op |
2019/06/11 15:22:24 [13345/3] STATS |  0: GET_FID        |    0 |    0 |    0 | 
        0 |  0.00 |
2019/06/11 15:22:24 [13345/3] STATS |  1: GET_INFO_DB    |    0 |    0 |    0 | 
    64865 |  0.36 |
2019/06/11 15:22:24 [13345/3] STATS |  2: GET_INFO_FS    |    0 |    0 |    0 | 
    64865 |  0.01 |
2019/06/11 15:22:24 [13345/3] STATS |  3: PRE_APPLY      |    0 |    0 |    0 | 
    64865 |  0.00 |
2019/06/11 15:22:24 [13345/3] STATS |  4: DB_APPLY       |    0 |    0 |    0 | 
    64865 |  1.25 | 15.53% batched (avg batch size: 2.4)
2019/06/11 15:22:24 [13345/3] STATS |  5: CHGLOG_CLR     |    0 |    0 |    0 | 
        0 |  0.00 |
2019/06/11 15:22:24 [13345/3] STATS |  6: RM_OLD_ENTRIES |    0 |    0 |    0 | 
        0 |  0.00 |
2019/06/11 15:22:24 [13345/3] STATS | DB ops: get=0/ins=64865/upd=0/rm=0
2019/06/11 15:37:24 [13345/3] STATS | ==================== Dumping stats at 
2019/06/11 15:37:24 =====================
2019/06/11 15:37:24 [13345/3] STATS | ======== General statistics =========
2019/06/11 15:37:24 [13345/3] STATS | Daemon start time: 2019/06/11 15:07:24
2019/06/11 15:37:24 [13345/3] STATS | Started modules: scan
2019/06/11 15:37:24 [13345/3] STATS | ======== FS scan statistics =========
2019/06/11 15:37:24 [13345/3] STATS | current scan interval = 18.5d
2019/06/11 15:37:24 [13345/3] STATS | scan is running:
2019/06/11 15:37:24 [13345/3] STATS |      started at : 2019/06/11 15:07:24 
(30.0min ago)
2019/06/11 15:37:24 [13345/3] STATS |      last action: 2019/06/11 15:37:24 
(00s ago)
2019/06/11 15:37:24 [13345/3] STATS |      progress   : 135638 entries scanned 
(0 errors)
2019/06/11 15:37:24 [13345/3] STATS |      inst. speed (potential):     48.26 
entries/sec (41.45 ms/entry/thread)
2019/06/11 15:37:24 [13345/3] STATS |      avg. speed  (effective):     75.35 
entries/sec (26.47 ms/entry/thread)
2019/06/11 15:37:24 [13345/3] STATS | ==== EntryProcessor Pipeline Stats ===
2019/06/11 15:37:24 [13345/3] STATS | Idle threads: 16
2019/06/11 15:37:24 [13345/3] STATS | Id constraints count: 0 (hash 
min=0/max=0/avg=0.0)
2019/06/11 15:37:24 [13345/3] STATS | Name constraints count: 0 (hash 
min=0/max=0/avg=0.0)
2019/06/11 15:37:24 [13345/3] STATS | Stage              | Wait | Curr | Done | 
    Total | ms/op |
2019/06/11 15:37:24 [13345/3] STATS |  0: GET_FID        |    0 |    0 |    0 | 
        0 |  0.00 |
2019/06/11 15:37:24 [13345/3] STATS |  1: GET_INFO_DB    |    0 |    0 |    0 | 
    64445 |  0.36 |
2019/06/11 15:37:24 [13345/3] STATS |  2: GET_INFO_FS    |    0 |    0 |    0 | 
    64445 |  0.01 |
2019/06/11 15:37:24 [13345/3] STATS |  3: PRE_APPLY      |    0 |    0 |    0 | 
    64445 |  0.00 |
2019/06/11 15:37:24 [13345/3] STATS |  4: DB_APPLY       |    0 |    0 |    0 | 
    64445 |  1.28 | 13.67% batched (avg batch size: 2.3)
2019/06/11 15:37:24 [13345/3] STATS |  5: CHGLOG_CLR     |    0 |    0 |    0 | 
        0 |  0.00 |
2019/06/11 15:37:24 [13345/3] STATS |  6: RM_OLD_ENTRIES |    0 |    0 |    0 | 
        0 |  0.00 |
2019/06/11 15:37:24 [13345/3] STATS | DB ops: get=0/ins=129310/upd=0/rm=0
2019/06/11 15:52:24 [13345/3] STATS | ==================== Dumping stats at 
2019/06/11 15:52:24 =====================
2019/06/11 15:52:24 [13345/3] STATS | ======== General statistics =========
2019/06/11 15:52:24 [13345/3] STATS | Daemon start time: 2019/06/11 15:07:24
2019/06/11 15:52:24 [13345/3] STATS | Started modules: scan
2019/06/11 15:52:24 [13345/3] STATS | ======== FS scan statistics =========
2019/06/11 15:52:24 [13345/3] STATS | current scan interval = 18.5d
2019/06/11 15:52:24 [13345/3] STATS | scan is running:
2019/06/11 15:52:24 [13345/3] STATS |      started at : 2019/06/11 15:07:24 
(45.0min ago)
2019/06/11 15:52:24 [13345/3] STATS |      last action: 2019/06/11 15:52:24 
(00s ago)
2019/06/11 15:52:24 [13345/3] STATS |      progress   : 202031 entries scanned 
(0 errors)
2019/06/11 15:52:24 [13345/3] STATS |      inst. speed (potential):     38.16 
entries/sec (52.41 ms/entry/thread)
2019/06/11 15:52:24 [13345/3] STATS |      avg. speed  (effective):     74.83 
entries/sec (26.65 ms/entry/thread)
2019/06/11 15:52:24 [13345/3] STATS | ==== EntryProcessor Pipeline Stats ===
2019/06/11 15:52:24 [13345/3] STATS | Idle threads: 16
2019/06/11 15:52:24 [13345/3] STATS | Id constraints count: 0 (hash 
min=0/max=0/avg=0.0)
2019/06/11 15:52:24 [13345/3] STATS | Name constraints count: 0 (hash 
min=0/max=0/avg=0.0)
2019/06/11 15:52:24 [13345/3] STATS | Stage              | Wait | Curr | Done | 
    Total | ms/op |
2019/06/11 15:52:24 [13345/3] STATS |  0: GET_FID        |    0 |    0 |    0 | 
        0 |  0.00 |
2019/06/11 15:52:24 [13345/3] STATS |  1: GET_INFO_DB    |    0 |    0 |    0 | 
    67000 |  0.36 |
2019/06/11 15:52:24 [13345/3] STATS |  2: GET_INFO_FS    |    0 |    0 |    0 | 
    67000 |  0.01 |
2019/06/11 15:52:24 [13345/3] STATS |  3: PRE_APPLY      |    0 |    0 |    0 | 
    67000 |  0.00 |
2019/06/11 15:52:24 [13345/3] STATS |  4: DB_APPLY       |    0 |    0 |    0 | 
    67000 |  1.27 | 12.16% batched (avg batch size: 2.4)
2019/06/11 15:52:24 [13345/3] STATS |  5: CHGLOG_CLR     |    0 |    0 |    0 | 
        0 |  0.00 |
2019/06/11 15:52:24 [13345/3] STATS |  6: RM_OLD_ENTRIES |    0 |    0 |    0 | 
        0 |  0.00 |
2019/06/11 15:52:24 [13345/3] STATS | DB ops: get=0/ins=196310/upd=0/rm=0
2019/06/12 01:07:24 [13345/3] STATS | ==================== Dumping stats at 
2019/06/12 01:07:24 =====================
2019/06/12 01:07:24 [13345/3] STATS | ======== General statistics =========
2019/06/12 01:07:24 [13345/3] STATS | Daemon start time: 2019/06/11 15:07:24
2019/06/12 01:07:24 [13345/3] STATS | Started modules: scan
2019/06/12 01:07:24 [13345/3] STATS | ======== FS scan statistics =========
2019/06/12 01:07:24 [13345/3] STATS | current scan interval = 18.5d
2019/06/12 01:07:24 [13345/3] STATS | scan is running:
2019/06/12 01:07:24 [13345/3] STATS |      started at : 2019/06/11 15:07:24 
(10.0h ago)
2019/06/12 01:07:24 [13345/3] STATS |      last action: 2019/06/12 01:07:24 
(00s ago)
2019/06/12 01:07:24 [13345/3] STATS |      progress   : 3840469 entries scanned 
(0 errors)
2019/06/12 01:07:24 [13345/3] STATS |      inst. speed (potential):     30.58 
entries/sec (65.39 ms/entry/thread)
2019/06/12 01:07:24 [13345/3] STATS |      avg. speed  (effective):    106.68 
entries/sec (18.69 ms/entry/thread)
2019/06/12 01:07:24 [13345/3] STATS | ==== EntryProcessor Pipeline Stats ===
2019/06/12 01:07:24 [13345/3] STATS | Idle threads: 16
2019/06/12 01:07:24 [13345/3] STATS | Id constraints count: 0 (hash 
min=0/max=0/avg=0.0)
2019/06/12 01:07:24 [13345/3] STATS | Name constraints count: 0 (hash 
min=0/max=0/avg=0.0)
2019/06/12 01:07:24 [13345/3] STATS | Stage              | Wait | Curr | Done | 
    Total | ms/op |
2019/06/12 01:07:24 [13345/3] STATS |  0: GET_FID        |    0 |    0 |    0 | 
        0 |  0.00 |
2019/06/12 01:07:24 [13345/3] STATS |  1: GET_INFO_DB    |    0 |    0 |    0 | 
    67331 |  0.36 |
2019/06/12 01:07:24 [13345/3] STATS |  2: GET_INFO_FS    |    0 |    0 |    0 | 
    67331 |  0.01 |
2019/06/12 01:07:24 [13345/3] STATS |  3: PRE_APPLY      |    0 |    0 |    0 | 
    67331 |  0.00 |
2019/06/12 01:07:24 [13345/3] STATS |  4: DB_APPLY       |    0 |    0 |    0 | 
    67331 |  1.29 | 12.56% batched (avg batch size: 2.4)
2019/06/12 01:07:24 [13345/3] STATS |  5: CHGLOG_CLR     |    0 |    0 |    0 | 
        0 |  0.00 |
2019/06/12 01:07:24 [13345/3] STATS |  6: RM_OLD_ENTRIES |    0 |    0 |    0 | 
        0 |  0.00 |
2019/06/12 01:07:24 [13345/3] STATS | DB ops: get=0/ins=3838009/upd=0/rm=0
...
2019/06/12 01:21:21 [13345/4] FS_Scan | Full scan of /lustre/alice completed, 
3948736 entries found (0 errors). Duration = 36837.18s
2019/06/12 01:21:21 [13345/4] FS_Scan | Flushing pipeline...
2019/06/12 01:21:26 [13345/4] FS_Scan | File list of /lustre/alice has been 
updated
2019/06/12 01:21:26 [13345/1] Main | FS Scan finished
In general each fullscan of the new Lustre installation is very slow compared 
to the old installation.

We have also done a fullscan of both the new and old Lustre installation with 
the same Robinhood client and server setup.

2019/06/12 01:21:26 [13345/1] EntryProc | Pipeline successfully flushed
2019/06/12 01:21:26 [13345/1] STATS | ==== EntryProcessor Pipeline Stats ===
2019/06/12 01:21:26 [13345/1] STATS | Idle threads: 0
2019/06/12 01:21:26 [13345/1] STATS | Id constraints count: 0 (hash 
min=0/max=0/avg=0.0)
2019/06/12 01:21:26 [13345/1] STATS | Name constraints count: 0 (hash 
min=0/max=0/avg=0.0)
2019/06/12 01:21:26 [13345/1] STATS | Stage              | Wait | Curr | Done | 
    Total | ms/op |
2019/06/12 01:21:26 [13345/1] STATS |  0: GET_FID        |    0 |    0 |    0 | 
    43782 |  0.72 |
2019/06/12 01:21:26 [13345/1] STATS |  1: GET_INFO_DB    |    0 |    0 |    0 | 
   110728 |  0.33 |
2019/06/12 01:21:26 [13345/1] STATS |  2: GET_INFO_FS    |    0 |    0 |    0 | 
   110728 |  0.23 |
2019/06/12 01:21:26 [13345/1] STATS |  3: PRE_APPLY      |    0 |    0 |    0 | 
   110728 |  0.00 |
2019/06/12 01:21:26 [13345/1] STATS |  4: DB_APPLY       |    0 |    0 |    0 | 
   110728 |  0.66 | 61.59% batched (avg batch size: 18.9)
2019/06/12 01:21:26 [13345/1] STATS |  5: CHGLOG_CLR     |    0 |    0 |    0 | 
        0 |  0.00 |
2019/06/12 01:21:26 [13345/1] STATS |  6: RM_OLD_ENTRIES |    0 |    0 |    0 | 
        1 | 4566.98 |
2019/06/12 01:21:26 [13345/1] STATS | DB ops: get=0/ins=3948737/upd=0/rm=0
2019/06/12 01:21:26 [13345/1] Main | All tasks done! Exiting.

What do you think?
Is there anything suspicious about the dumped stats that could hint to a very 
slow processing?
Is it a problem that the GET_FID stages values are 0 except in the last dumped 
stats?

Let me know if you need more information.

We would appreciate your help.


Best regards
Gabriele Iannetti

_______________________________________________
robinhood-support mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/robinhood-support

Reply via email to