Hi all,
we are trying to run Robinhood for our next Lustre Installation with 2.10 which
is extremely slow.
I have also tested a fullscan with the same Robinhood client and server boxes
of our old Lustre installation with 2.5 and it performs significantly better
(on the old Lustre installation the avg. speed was about 2000 entries/sec
compared to few hundreds).
Just as a comparison the full scan of the new Lustre installation took
relatively long...
> Fullscan of old Lustre installation took 3 days for 594M entries.
> Fullscan of new Lustre installation took 23 days for 533M entries.
I do not understand the cause of the bad performance of the fullscan on our
next Lustre installation
(changelog processing is also slow).
So far no meta data performance or file server issues were detected.
The meta data operations were tested with 'lfs path2fid'/'ls' on each single
MDS and file server synchronous write and read performance were also tested
with few million small files from multiple clients in parallel.
Few numbers about our Lustre installations:
NEW INSTALLATION
Lustre Server Version: 2.10.6
Number of MDTs: 2
Number of OSTs: 294
OLD INSTALLATION
Lustre Version: 2.5
Number of MDTs: 2
Number of OSTs: 630
LUSTRE CLIENT
Tested with 2.10 and 2.12
SINGLE FOLDER SCAN EXAMPLE
In the following I post few lines of a Robinhood log of a folder scan on the
Lustre new installation,
where each stats dump looks the same except the last dump stats, where the
GET_FID stage has non 0 value:
2019/06/11 15:07:24 [13345/2] FS_Scan | Starting scan of /lustre/xxx
2019/06/11 15:07:24 [13345/2] FS_Scan | Notice: this is the first scan (DB is
empty)
2019/06/11 15:22:24 [13345/3] STATS | ==================== Dumping stats at
2019/06/11 15:22:24 =====================
2019/06/11 15:22:24 [13345/3] STATS | ======== General statistics =========
2019/06/11 15:22:24 [13345/3] STATS | Daemon start time: 2019/06/11 15:07:24
2019/06/11 15:22:24 [13345/3] STATS | Started modules: scan
2019/06/11 15:22:24 [13345/3] STATS | ======== FS scan statistics =========
2019/06/11 15:22:24 [13345/3] STATS | current scan interval = 18.5d
2019/06/11 15:22:24 [13345/3] STATS | scan is running:
2019/06/11 15:22:24 [13345/3] STATS | started at : 2019/06/11 15:07:24
(15.0min ago)
2019/06/11 15:22:24 [13345/3] STATS | last action: 2019/06/11 15:22:24
(00s ago)
2019/06/11 15:22:24 [13345/3] STATS | progress : 64932 entries scanned
(0 errors)
2019/06/11 15:22:24 [13345/3] STATS | inst. speed (potential): 35.92
entries/sec (55.67 ms/entry/thread)
2019/06/11 15:22:24 [13345/3] STATS | avg. speed (effective): 72.15
entries/sec (27.64 ms/entry/thread)
2019/06/11 15:22:24 [13345/3] STATS | ==== EntryProcessor Pipeline Stats ===
2019/06/11 15:22:24 [13345/3] STATS | Idle threads: 16
2019/06/11 15:22:24 [13345/3] STATS | Id constraints count: 0 (hash
min=0/max=0/avg=0.0)
2019/06/11 15:22:24 [13345/3] STATS | Name constraints count: 0 (hash
min=0/max=0/avg=0.0)
2019/06/11 15:22:24 [13345/3] STATS | Stage | Wait | Curr | Done |
Total | ms/op |
2019/06/11 15:22:24 [13345/3] STATS | 0: GET_FID | 0 | 0 | 0 |
0 | 0.00 |
2019/06/11 15:22:24 [13345/3] STATS | 1: GET_INFO_DB | 0 | 0 | 0 |
64865 | 0.36 |
2019/06/11 15:22:24 [13345/3] STATS | 2: GET_INFO_FS | 0 | 0 | 0 |
64865 | 0.01 |
2019/06/11 15:22:24 [13345/3] STATS | 3: PRE_APPLY | 0 | 0 | 0 |
64865 | 0.00 |
2019/06/11 15:22:24 [13345/3] STATS | 4: DB_APPLY | 0 | 0 | 0 |
64865 | 1.25 | 15.53% batched (avg batch size: 2.4)
2019/06/11 15:22:24 [13345/3] STATS | 5: CHGLOG_CLR | 0 | 0 | 0 |
0 | 0.00 |
2019/06/11 15:22:24 [13345/3] STATS | 6: RM_OLD_ENTRIES | 0 | 0 | 0 |
0 | 0.00 |
2019/06/11 15:22:24 [13345/3] STATS | DB ops: get=0/ins=64865/upd=0/rm=0
2019/06/11 15:37:24 [13345/3] STATS | ==================== Dumping stats at
2019/06/11 15:37:24 =====================
2019/06/11 15:37:24 [13345/3] STATS | ======== General statistics =========
2019/06/11 15:37:24 [13345/3] STATS | Daemon start time: 2019/06/11 15:07:24
2019/06/11 15:37:24 [13345/3] STATS | Started modules: scan
2019/06/11 15:37:24 [13345/3] STATS | ======== FS scan statistics =========
2019/06/11 15:37:24 [13345/3] STATS | current scan interval = 18.5d
2019/06/11 15:37:24 [13345/3] STATS | scan is running:
2019/06/11 15:37:24 [13345/3] STATS | started at : 2019/06/11 15:07:24
(30.0min ago)
2019/06/11 15:37:24 [13345/3] STATS | last action: 2019/06/11 15:37:24
(00s ago)
2019/06/11 15:37:24 [13345/3] STATS | progress : 135638 entries scanned
(0 errors)
2019/06/11 15:37:24 [13345/3] STATS | inst. speed (potential): 48.26
entries/sec (41.45 ms/entry/thread)
2019/06/11 15:37:24 [13345/3] STATS | avg. speed (effective): 75.35
entries/sec (26.47 ms/entry/thread)
2019/06/11 15:37:24 [13345/3] STATS | ==== EntryProcessor Pipeline Stats ===
2019/06/11 15:37:24 [13345/3] STATS | Idle threads: 16
2019/06/11 15:37:24 [13345/3] STATS | Id constraints count: 0 (hash
min=0/max=0/avg=0.0)
2019/06/11 15:37:24 [13345/3] STATS | Name constraints count: 0 (hash
min=0/max=0/avg=0.0)
2019/06/11 15:37:24 [13345/3] STATS | Stage | Wait | Curr | Done |
Total | ms/op |
2019/06/11 15:37:24 [13345/3] STATS | 0: GET_FID | 0 | 0 | 0 |
0 | 0.00 |
2019/06/11 15:37:24 [13345/3] STATS | 1: GET_INFO_DB | 0 | 0 | 0 |
64445 | 0.36 |
2019/06/11 15:37:24 [13345/3] STATS | 2: GET_INFO_FS | 0 | 0 | 0 |
64445 | 0.01 |
2019/06/11 15:37:24 [13345/3] STATS | 3: PRE_APPLY | 0 | 0 | 0 |
64445 | 0.00 |
2019/06/11 15:37:24 [13345/3] STATS | 4: DB_APPLY | 0 | 0 | 0 |
64445 | 1.28 | 13.67% batched (avg batch size: 2.3)
2019/06/11 15:37:24 [13345/3] STATS | 5: CHGLOG_CLR | 0 | 0 | 0 |
0 | 0.00 |
2019/06/11 15:37:24 [13345/3] STATS | 6: RM_OLD_ENTRIES | 0 | 0 | 0 |
0 | 0.00 |
2019/06/11 15:37:24 [13345/3] STATS | DB ops: get=0/ins=129310/upd=0/rm=0
2019/06/11 15:52:24 [13345/3] STATS | ==================== Dumping stats at
2019/06/11 15:52:24 =====================
2019/06/11 15:52:24 [13345/3] STATS | ======== General statistics =========
2019/06/11 15:52:24 [13345/3] STATS | Daemon start time: 2019/06/11 15:07:24
2019/06/11 15:52:24 [13345/3] STATS | Started modules: scan
2019/06/11 15:52:24 [13345/3] STATS | ======== FS scan statistics =========
2019/06/11 15:52:24 [13345/3] STATS | current scan interval = 18.5d
2019/06/11 15:52:24 [13345/3] STATS | scan is running:
2019/06/11 15:52:24 [13345/3] STATS | started at : 2019/06/11 15:07:24
(45.0min ago)
2019/06/11 15:52:24 [13345/3] STATS | last action: 2019/06/11 15:52:24
(00s ago)
2019/06/11 15:52:24 [13345/3] STATS | progress : 202031 entries scanned
(0 errors)
2019/06/11 15:52:24 [13345/3] STATS | inst. speed (potential): 38.16
entries/sec (52.41 ms/entry/thread)
2019/06/11 15:52:24 [13345/3] STATS | avg. speed (effective): 74.83
entries/sec (26.65 ms/entry/thread)
2019/06/11 15:52:24 [13345/3] STATS | ==== EntryProcessor Pipeline Stats ===
2019/06/11 15:52:24 [13345/3] STATS | Idle threads: 16
2019/06/11 15:52:24 [13345/3] STATS | Id constraints count: 0 (hash
min=0/max=0/avg=0.0)
2019/06/11 15:52:24 [13345/3] STATS | Name constraints count: 0 (hash
min=0/max=0/avg=0.0)
2019/06/11 15:52:24 [13345/3] STATS | Stage | Wait | Curr | Done |
Total | ms/op |
2019/06/11 15:52:24 [13345/3] STATS | 0: GET_FID | 0 | 0 | 0 |
0 | 0.00 |
2019/06/11 15:52:24 [13345/3] STATS | 1: GET_INFO_DB | 0 | 0 | 0 |
67000 | 0.36 |
2019/06/11 15:52:24 [13345/3] STATS | 2: GET_INFO_FS | 0 | 0 | 0 |
67000 | 0.01 |
2019/06/11 15:52:24 [13345/3] STATS | 3: PRE_APPLY | 0 | 0 | 0 |
67000 | 0.00 |
2019/06/11 15:52:24 [13345/3] STATS | 4: DB_APPLY | 0 | 0 | 0 |
67000 | 1.27 | 12.16% batched (avg batch size: 2.4)
2019/06/11 15:52:24 [13345/3] STATS | 5: CHGLOG_CLR | 0 | 0 | 0 |
0 | 0.00 |
2019/06/11 15:52:24 [13345/3] STATS | 6: RM_OLD_ENTRIES | 0 | 0 | 0 |
0 | 0.00 |
2019/06/11 15:52:24 [13345/3] STATS | DB ops: get=0/ins=196310/upd=0/rm=0
2019/06/12 01:07:24 [13345/3] STATS | ==================== Dumping stats at
2019/06/12 01:07:24 =====================
2019/06/12 01:07:24 [13345/3] STATS | ======== General statistics =========
2019/06/12 01:07:24 [13345/3] STATS | Daemon start time: 2019/06/11 15:07:24
2019/06/12 01:07:24 [13345/3] STATS | Started modules: scan
2019/06/12 01:07:24 [13345/3] STATS | ======== FS scan statistics =========
2019/06/12 01:07:24 [13345/3] STATS | current scan interval = 18.5d
2019/06/12 01:07:24 [13345/3] STATS | scan is running:
2019/06/12 01:07:24 [13345/3] STATS | started at : 2019/06/11 15:07:24
(10.0h ago)
2019/06/12 01:07:24 [13345/3] STATS | last action: 2019/06/12 01:07:24
(00s ago)
2019/06/12 01:07:24 [13345/3] STATS | progress : 3840469 entries scanned
(0 errors)
2019/06/12 01:07:24 [13345/3] STATS | inst. speed (potential): 30.58
entries/sec (65.39 ms/entry/thread)
2019/06/12 01:07:24 [13345/3] STATS | avg. speed (effective): 106.68
entries/sec (18.69 ms/entry/thread)
2019/06/12 01:07:24 [13345/3] STATS | ==== EntryProcessor Pipeline Stats ===
2019/06/12 01:07:24 [13345/3] STATS | Idle threads: 16
2019/06/12 01:07:24 [13345/3] STATS | Id constraints count: 0 (hash
min=0/max=0/avg=0.0)
2019/06/12 01:07:24 [13345/3] STATS | Name constraints count: 0 (hash
min=0/max=0/avg=0.0)
2019/06/12 01:07:24 [13345/3] STATS | Stage | Wait | Curr | Done |
Total | ms/op |
2019/06/12 01:07:24 [13345/3] STATS | 0: GET_FID | 0 | 0 | 0 |
0 | 0.00 |
2019/06/12 01:07:24 [13345/3] STATS | 1: GET_INFO_DB | 0 | 0 | 0 |
67331 | 0.36 |
2019/06/12 01:07:24 [13345/3] STATS | 2: GET_INFO_FS | 0 | 0 | 0 |
67331 | 0.01 |
2019/06/12 01:07:24 [13345/3] STATS | 3: PRE_APPLY | 0 | 0 | 0 |
67331 | 0.00 |
2019/06/12 01:07:24 [13345/3] STATS | 4: DB_APPLY | 0 | 0 | 0 |
67331 | 1.29 | 12.56% batched (avg batch size: 2.4)
2019/06/12 01:07:24 [13345/3] STATS | 5: CHGLOG_CLR | 0 | 0 | 0 |
0 | 0.00 |
2019/06/12 01:07:24 [13345/3] STATS | 6: RM_OLD_ENTRIES | 0 | 0 | 0 |
0 | 0.00 |
2019/06/12 01:07:24 [13345/3] STATS | DB ops: get=0/ins=3838009/upd=0/rm=0
...
2019/06/12 01:21:21 [13345/4] FS_Scan | Full scan of /lustre/alice completed,
3948736 entries found (0 errors). Duration = 36837.18s
2019/06/12 01:21:21 [13345/4] FS_Scan | Flushing pipeline...
2019/06/12 01:21:26 [13345/4] FS_Scan | File list of /lustre/alice has been
updated
2019/06/12 01:21:26 [13345/1] Main | FS Scan finished
In general each fullscan of the new Lustre installation is very slow compared
to the old installation.
We have also done a fullscan of both the new and old Lustre installation with
the same Robinhood client and server setup.
2019/06/12 01:21:26 [13345/1] EntryProc | Pipeline successfully flushed
2019/06/12 01:21:26 [13345/1] STATS | ==== EntryProcessor Pipeline Stats ===
2019/06/12 01:21:26 [13345/1] STATS | Idle threads: 0
2019/06/12 01:21:26 [13345/1] STATS | Id constraints count: 0 (hash
min=0/max=0/avg=0.0)
2019/06/12 01:21:26 [13345/1] STATS | Name constraints count: 0 (hash
min=0/max=0/avg=0.0)
2019/06/12 01:21:26 [13345/1] STATS | Stage | Wait | Curr | Done |
Total | ms/op |
2019/06/12 01:21:26 [13345/1] STATS | 0: GET_FID | 0 | 0 | 0 |
43782 | 0.72 |
2019/06/12 01:21:26 [13345/1] STATS | 1: GET_INFO_DB | 0 | 0 | 0 |
110728 | 0.33 |
2019/06/12 01:21:26 [13345/1] STATS | 2: GET_INFO_FS | 0 | 0 | 0 |
110728 | 0.23 |
2019/06/12 01:21:26 [13345/1] STATS | 3: PRE_APPLY | 0 | 0 | 0 |
110728 | 0.00 |
2019/06/12 01:21:26 [13345/1] STATS | 4: DB_APPLY | 0 | 0 | 0 |
110728 | 0.66 | 61.59% batched (avg batch size: 18.9)
2019/06/12 01:21:26 [13345/1] STATS | 5: CHGLOG_CLR | 0 | 0 | 0 |
0 | 0.00 |
2019/06/12 01:21:26 [13345/1] STATS | 6: RM_OLD_ENTRIES | 0 | 0 | 0 |
1 | 4566.98 |
2019/06/12 01:21:26 [13345/1] STATS | DB ops: get=0/ins=3948737/upd=0/rm=0
2019/06/12 01:21:26 [13345/1] Main | All tasks done! Exiting.
What do you think?
Is there anything suspicious about the dumped stats that could hint to a very
slow processing?
Is it a problem that the GET_FID stages values are 0 except in the last dumped
stats?
Let me know if you need more information.
We would appreciate your help.
Best regards
Gabriele Iannetti
_______________________________________________
robinhood-support mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/robinhood-support