Hi Marion, thank you for sharing your experience!
As far as I have instigated the meta data performance on our Lustre MDTs there are no issues. I am not blaming Robinhood for the slowdown, but asking for help to figure out what the root causes could be for this. Best regards, Gabriele ________________________________________ From: Marion Hakanson <hakan...@ohsu.edu> Sent: Thursday, June 13, 2019 00:15 To: Iannetti, Gabriele Cc: robinhood-support@lists.sourceforge.net Subject: Re: [robinhood-support] Lustre fullscan extremely slow on Lustre 2.10 Our experience has been similar. Initial scan, CentOS-6.6, Robinhood 2.5, Lustre IEEL-2.3 (2.5-based) took 6 days for 387M entries. We recently upgraded all servers and clients to CentOS-7.5, Lustre 2.10.6, and Robinhood 3.1.4; Same hardware everywhere. Initial scan has been running for 116 days, has only scanned 92M items (out of 183M). Note that for us, Lustre-2.10.6 has been quite a bit slower for interactive use (ls in a new directory, etc.) than the previous version. There just seems to be a lot more locking overhead. So we are not at this time blaming Robinhood for this slowdown. I'm sorry that I can't answer any of your questions. Regards, Marion > From: "Iannetti, Gabriele" <g.ianne...@gsi.de> > To: "robinhood-support@lists.sourceforge.net" > <robinhood-support@lists.sourceforge.net> > Date: Wed, 12 Jun 2019 12:52:03 +0000 > > Hi all, > > we are trying to run Robinhood for our next Lustre Installation with 2.10 > which is extremely slow. > > I have also tested a fullscan with the same Robinhood client and server boxes > of our old Lustre installation with 2.5 and it performs significantly better > (on the old Lustre installation the avg. speed was about 2000 entries/sec > compared to few hundreds). > > Just as a comparison the full scan of the new Lustre installation took > relatively long... > > > Fullscan of old Lustre installation took 3 days for 594M entries. > > Fullscan of new Lustre installation took 23 days for 533M entries. > > I do not understand the cause of the bad performance of the fullscan on our > next Lustre installation > (changelog processing is also slow). > > So far no meta data performance or file server issues were detected. > The meta data operations were tested with 'lfs path2fid'/'ls' on each single > MDS and file server synchronous write and read performance were also tested > with few million small files from multiple clients in parallel. > > Few numbers about our Lustre installations: > > NEW INSTALLATION > > Lustre Server Version: 2.10.6 > Number of MDTs: 2 > Number of OSTs: 294 > > OLD INSTALLATION > > Lustre Version: 2.5 > Number of MDTs: 2 > Number of OSTs: 630 > > LUSTRE CLIENT > Tested with 2.10 and 2.12 > > SINGLE FOLDER SCAN EXAMPLE > > In the following I post few lines of a Robinhood log of a folder scan on the > Lustre new installation, > where each stats dump looks the same except the last dump stats, where the > GET_FID stage has non 0 value: > > 2019/06/11 15:07:24 [13345/2] FS_Scan | Starting scan of /lustre/xxx > 2019/06/11 15:07:24 [13345/2] FS_Scan | Notice: this is the first scan (DB is > empty) > 2019/06/11 15:22:24 [13345/3] STATS | ==================== Dumping stats at > 2019/06/11 15:22:24 ===================== > 2019/06/11 15:22:24 [13345/3] STATS | ======== General statistics ========= > 2019/06/11 15:22:24 [13345/3] STATS | Daemon start time: 2019/06/11 15:07:24 > 2019/06/11 15:22:24 [13345/3] STATS | Started modules: scan > 2019/06/11 15:22:24 [13345/3] STATS | ======== FS scan statistics ========= > 2019/06/11 15:22:24 [13345/3] STATS | current scan interval = 18.5d > 2019/06/11 15:22:24 [13345/3] STATS | scan is running: > 2019/06/11 15:22:24 [13345/3] STATS | started at : 2019/06/11 15:07:24 > (15.0min ago) > 2019/06/11 15:22:24 [13345/3] STATS | last action: 2019/06/11 15:22:24 > (00s ago) > 2019/06/11 15:22:24 [13345/3] STATS | progress : 64932 entries scanned > (0 errors) > 2019/06/11 15:22:24 [13345/3] STATS | inst. speed (potential): 35.92 > entries/sec (55.67 ms/entry/thread) > 2019/06/11 15:22:24 [13345/3] STATS | avg. speed (effective): 72.15 > entries/sec (27.64 ms/entry/thread) > 2019/06/11 15:22:24 [13345/3] STATS | ==== EntryProcessor Pipeline Stats === > 2019/06/11 15:22:24 [13345/3] STATS | Idle threads: 16 > 2019/06/11 15:22:24 [13345/3] STATS | Id constraints count: 0 (hash > min=0/max=0/avg=0.0) > 2019/06/11 15:22:24 [13345/3] STATS | Name constraints count: 0 (hash > min=0/max=0/avg=0.0) > 2019/06/11 15:22:24 [13345/3] STATS | Stage | Wait | Curr | Done > | Total | ms/op | > 2019/06/11 15:22:24 [13345/3] STATS | 0: GET_FID | 0 | 0 | 0 > | 0 | 0.00 | > 2019/06/11 15:22:24 [13345/3] STATS | 1: GET_INFO_DB | 0 | 0 | 0 > | 64865 | 0.36 | > 2019/06/11 15:22:24 [13345/3] STATS | 2: GET_INFO_FS | 0 | 0 | 0 > | 64865 | 0.01 | > 2019/06/11 15:22:24 [13345/3] STATS | 3: PRE_APPLY | 0 | 0 | 0 > | 64865 | 0.00 | > 2019/06/11 15:22:24 [13345/3] STATS | 4: DB_APPLY | 0 | 0 | 0 > | 64865 | 1.25 | 15.53% batched (avg batch size: 2.4) > 2019/06/11 15:22:24 [13345/3] STATS | 5: CHGLOG_CLR | 0 | 0 | 0 > | 0 | 0.00 | > 2019/06/11 15:22:24 [13345/3] STATS | 6: RM_OLD_ENTRIES | 0 | 0 | 0 > | 0 | 0.00 | > 2019/06/11 15:22:24 [13345/3] STATS | DB ops: get=0/ins=64865/upd=0/rm=0 > 2019/06/11 15:37:24 [13345/3] STATS | ==================== Dumping stats at > 2019/06/11 15:37:24 ===================== > 2019/06/11 15:37:24 [13345/3] STATS | ======== General statistics ========= > 2019/06/11 15:37:24 [13345/3] STATS | Daemon start time: 2019/06/11 15:07:24 > 2019/06/11 15:37:24 [13345/3] STATS | Started modules: scan > 2019/06/11 15:37:24 [13345/3] STATS | ======== FS scan statistics ========= > 2019/06/11 15:37:24 [13345/3] STATS | current scan interval = 18.5d > 2019/06/11 15:37:24 [13345/3] STATS | scan is running: > 2019/06/11 15:37:24 [13345/3] STATS | started at : 2019/06/11 15:07:24 > (30.0min ago) > 2019/06/11 15:37:24 [13345/3] STATS | last action: 2019/06/11 15:37:24 > (00s ago) > 2019/06/11 15:37:24 [13345/3] STATS | progress : 135638 entries > scanned (0 errors) > 2019/06/11 15:37:24 [13345/3] STATS | inst. speed (potential): 48.26 > entries/sec (41.45 ms/entry/thread) > 2019/06/11 15:37:24 [13345/3] STATS | avg. speed (effective): 75.35 > entries/sec (26.47 ms/entry/thread) > 2019/06/11 15:37:24 [13345/3] STATS | ==== EntryProcessor Pipeline Stats === > 2019/06/11 15:37:24 [13345/3] STATS | Idle threads: 16 > 2019/06/11 15:37:24 [13345/3] STATS | Id constraints count: 0 (hash > min=0/max=0/avg=0.0) > 2019/06/11 15:37:24 [13345/3] STATS | Name constraints count: 0 (hash > min=0/max=0/avg=0.0) > 2019/06/11 15:37:24 [13345/3] STATS | Stage | Wait | Curr | Done > | Total | ms/op | > 2019/06/11 15:37:24 [13345/3] STATS | 0: GET_FID | 0 | 0 | 0 > | 0 | 0.00 | > 2019/06/11 15:37:24 [13345/3] STATS | 1: GET_INFO_DB | 0 | 0 | 0 > | 64445 | 0.36 | > 2019/06/11 15:37:24 [13345/3] STATS | 2: GET_INFO_FS | 0 | 0 | 0 > | 64445 | 0.01 | > 2019/06/11 15:37:24 [13345/3] STATS | 3: PRE_APPLY | 0 | 0 | 0 > | 64445 | 0.00 | > 2019/06/11 15:37:24 [13345/3] STATS | 4: DB_APPLY | 0 | 0 | 0 > | 64445 | 1.28 | 13.67% batched (avg batch size: 2.3) > 2019/06/11 15:37:24 [13345/3] STATS | 5: CHGLOG_CLR | 0 | 0 | 0 > | 0 | 0.00 | > 2019/06/11 15:37:24 [13345/3] STATS | 6: RM_OLD_ENTRIES | 0 | 0 | 0 > | 0 | 0.00 | > 2019/06/11 15:37:24 [13345/3] STATS | DB ops: get=0/ins=129310/upd=0/rm=0 > 2019/06/11 15:52:24 [13345/3] STATS | ==================== Dumping stats at > 2019/06/11 15:52:24 ===================== > 2019/06/11 15:52:24 [13345/3] STATS | ======== General statistics ========= > 2019/06/11 15:52:24 [13345/3] STATS | Daemon start time: 2019/06/11 15:07:24 > 2019/06/11 15:52:24 [13345/3] STATS | Started modules: scan > 2019/06/11 15:52:24 [13345/3] STATS | ======== FS scan statistics ========= > 2019/06/11 15:52:24 [13345/3] STATS | current scan interval = 18.5d > 2019/06/11 15:52:24 [13345/3] STATS | scan is running: > 2019/06/11 15:52:24 [13345/3] STATS | started at : 2019/06/11 15:07:24 > (45.0min ago) > 2019/06/11 15:52:24 [13345/3] STATS | last action: 2019/06/11 15:52:24 > (00s ago) > 2019/06/11 15:52:24 [13345/3] STATS | progress : 202031 entries > scanned (0 errors) > 2019/06/11 15:52:24 [13345/3] STATS | inst. speed (potential): 38.16 > entries/sec (52.41 ms/entry/thread) > 2019/06/11 15:52:24 [13345/3] STATS | avg. speed (effective): 74.83 > entries/sec (26.65 ms/entry/thread) > 2019/06/11 15:52:24 [13345/3] STATS | ==== EntryProcessor Pipeline Stats === > 2019/06/11 15:52:24 [13345/3] STATS | Idle threads: 16 > 2019/06/11 15:52:24 [13345/3] STATS | Id constraints count: 0 (hash > min=0/max=0/avg=0.0) > 2019/06/11 15:52:24 [13345/3] STATS | Name constraints count: 0 (hash > min=0/max=0/avg=0.0) > 2019/06/11 15:52:24 [13345/3] STATS | Stage | Wait | Curr | Done > | Total | ms/op | > 2019/06/11 15:52:24 [13345/3] STATS | 0: GET_FID | 0 | 0 | 0 > | 0 | 0.00 | > 2019/06/11 15:52:24 [13345/3] STATS | 1: GET_INFO_DB | 0 | 0 | 0 > | 67000 | 0.36 | > 2019/06/11 15:52:24 [13345/3] STATS | 2: GET_INFO_FS | 0 | 0 | 0 > | 67000 | 0.01 | > 2019/06/11 15:52:24 [13345/3] STATS | 3: PRE_APPLY | 0 | 0 | 0 > | 67000 | 0.00 | > 2019/06/11 15:52:24 [13345/3] STATS | 4: DB_APPLY | 0 | 0 | 0 > | 67000 | 1.27 | 12.16% batched (avg batch size: 2.4) > 2019/06/11 15:52:24 [13345/3] STATS | 5: CHGLOG_CLR | 0 | 0 | 0 > | 0 | 0.00 | > 2019/06/11 15:52:24 [13345/3] STATS | 6: RM_OLD_ENTRIES | 0 | 0 | 0 > | 0 | 0.00 | > 2019/06/11 15:52:24 [13345/3] STATS | DB ops: get=0/ins=196310/upd=0/rm=0 > 2019/06/12 01:07:24 [13345/3] STATS | ==================== Dumping stats at > 2019/06/12 01:07:24 ===================== > 2019/06/12 01:07:24 [13345/3] STATS | ======== General statistics ========= > 2019/06/12 01:07:24 [13345/3] STATS | Daemon start time: 2019/06/11 15:07:24 > 2019/06/12 01:07:24 [13345/3] STATS | Started modules: scan > 2019/06/12 01:07:24 [13345/3] STATS | ======== FS scan statistics ========= > 2019/06/12 01:07:24 [13345/3] STATS | current scan interval = 18.5d > 2019/06/12 01:07:24 [13345/3] STATS | scan is running: > 2019/06/12 01:07:24 [13345/3] STATS | started at : 2019/06/11 15:07:24 > (10.0h ago) > 2019/06/12 01:07:24 [13345/3] STATS | last action: 2019/06/12 01:07:24 > (00s ago) > 2019/06/12 01:07:24 [13345/3] STATS | progress : 3840469 entries > scanned (0 errors) > 2019/06/12 01:07:24 [13345/3] STATS | inst. speed (potential): 30.58 > entries/sec (65.39 ms/entry/thread) > 2019/06/12 01:07:24 [13345/3] STATS | avg. speed (effective): 106.68 > entries/sec (18.69 ms/entry/thread) > 2019/06/12 01:07:24 [13345/3] STATS | ==== EntryProcessor Pipeline Stats === > 2019/06/12 01:07:24 [13345/3] STATS | Idle threads: 16 > 2019/06/12 01:07:24 [13345/3] STATS | Id constraints count: 0 (hash > min=0/max=0/avg=0.0) > 2019/06/12 01:07:24 [13345/3] STATS | Name constraints count: 0 (hash > min=0/max=0/avg=0.0) > 2019/06/12 01:07:24 [13345/3] STATS | Stage | Wait | Curr | Done > | Total | ms/op | > 2019/06/12 01:07:24 [13345/3] STATS | 0: GET_FID | 0 | 0 | 0 > | 0 | 0.00 | > 2019/06/12 01:07:24 [13345/3] STATS | 1: GET_INFO_DB | 0 | 0 | 0 > | 67331 | 0.36 | > 2019/06/12 01:07:24 [13345/3] STATS | 2: GET_INFO_FS | 0 | 0 | 0 > | 67331 | 0.01 | > 2019/06/12 01:07:24 [13345/3] STATS | 3: PRE_APPLY | 0 | 0 | 0 > | 67331 | 0.00 | > 2019/06/12 01:07:24 [13345/3] STATS | 4: DB_APPLY | 0 | 0 | 0 > | 67331 | 1.29 | 12.56% batched (avg batch size: 2.4) > 2019/06/12 01:07:24 [13345/3] STATS | 5: CHGLOG_CLR | 0 | 0 | 0 > | 0 | 0.00 | > 2019/06/12 01:07:24 [13345/3] STATS | 6: RM_OLD_ENTRIES | 0 | 0 | 0 > | 0 | 0.00 | > 2019/06/12 01:07:24 [13345/3] STATS | DB ops: get=0/ins=3838009/upd=0/rm=0 > ... > 2019/06/12 01:21:21 [13345/4] FS_Scan | Full scan of /lustre/alice completed, > 3948736 entries found (0 errors). Duration = 36837.18s > 2019/06/12 01:21:21 [13345/4] FS_Scan | Flushing pipeline... > 2019/06/12 01:21:26 [13345/4] FS_Scan | File list of /lustre/alice has been > updated > 2019/06/12 01:21:26 [13345/1] Main | FS Scan finished > In general each fullscan of the new Lustre installation is very slow compared > to the old installation. > > We have also done a fullscan of both the new and old Lustre installation with > the same Robinhood client and server setup. > > 2019/06/12 01:21:26 [13345/1] EntryProc | Pipeline successfully flushed > 2019/06/12 01:21:26 [13345/1] STATS | ==== EntryProcessor Pipeline Stats === > 2019/06/12 01:21:26 [13345/1] STATS | Idle threads: 0 > 2019/06/12 01:21:26 [13345/1] STATS | Id constraints count: 0 (hash > min=0/max=0/avg=0.0) > 2019/06/12 01:21:26 [13345/1] STATS | Name constraints count: 0 (hash > min=0/max=0/avg=0.0) > 2019/06/12 01:21:26 [13345/1] STATS | Stage | Wait | Curr | Done > | Total | ms/op | > 2019/06/12 01:21:26 [13345/1] STATS | 0: GET_FID | 0 | 0 | 0 > | 43782 | 0.72 | > 2019/06/12 01:21:26 [13345/1] STATS | 1: GET_INFO_DB | 0 | 0 | 0 > | 110728 | 0.33 | > 2019/06/12 01:21:26 [13345/1] STATS | 2: GET_INFO_FS | 0 | 0 | 0 > | 110728 | 0.23 | > 2019/06/12 01:21:26 [13345/1] STATS | 3: PRE_APPLY | 0 | 0 | 0 > | 110728 | 0.00 | > 2019/06/12 01:21:26 [13345/1] STATS | 4: DB_APPLY | 0 | 0 | 0 > | 110728 | 0.66 | 61.59% batched (avg batch size: 18.9) > 2019/06/12 01:21:26 [13345/1] STATS | 5: CHGLOG_CLR | 0 | 0 | 0 > | 0 | 0.00 | > 2019/06/12 01:21:26 [13345/1] STATS | 6: RM_OLD_ENTRIES | 0 | 0 | 0 > | 1 | 4566.98 | > 2019/06/12 01:21:26 [13345/1] STATS | DB ops: get=0/ins=3948737/upd=0/rm=0 > 2019/06/12 01:21:26 [13345/1] Main | All tasks done! Exiting. > > What do you think? > Is there anything suspicious about the dumped stats that could hint to a very > slow processing? > Is it a problem that the GET_FID stages values are 0 except in the last > dumped stats? > > Let me know if you need more information. > > We would appreciate your help. > > > Best regards > Gabriele Iannetti > > _______________________________________________ > robinhood-support mailing list > robinhood-support@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/robinhood-support > _______________________________________________ robinhood-support mailing list robinhood-support@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/robinhood-support