Hi Marion,

thank you for sharing your experience!

As far as I have instigated the meta data performance on our Lustre MDTs there 
are no issues.

I am not blaming Robinhood for the slowdown, but asking for help to figure out 
what the root causes could be for this.


Best regards,
Gabriele


________________________________________
From: Marion Hakanson <hakan...@ohsu.edu>
Sent: Thursday, June 13, 2019 00:15
To: Iannetti, Gabriele
Cc: robinhood-support@lists.sourceforge.net
Subject: Re: [robinhood-support] Lustre fullscan extremely slow on Lustre 2.10

Our experience has been similar.

Initial scan, CentOS-6.6, Robinhood 2.5, Lustre IEEL-2.3 (2.5-based)
took 6 days for 387M entries.

We recently upgraded all servers and clients to CentOS-7.5, Lustre 2.10.6,
and Robinhood 3.1.4;  Same hardware everywhere.  Initial scan has been
running for 116 days, has only scanned 92M items (out of 183M).

Note that for us, Lustre-2.10.6 has been quite a bit slower for interactive
use (ls in a new directory, etc.) than the previous version.  There just seems
to be a lot more locking overhead.  So we are not at this time blaming
Robinhood for this slowdown.

I'm sorry that I can't answer any of your questions.

Regards,

Marion



> From: "Iannetti, Gabriele" <g.ianne...@gsi.de>
> To: "robinhood-support@lists.sourceforge.net"
>       <robinhood-support@lists.sourceforge.net>
> Date: Wed, 12 Jun 2019 12:52:03 +0000
>
> Hi all,
>
> we are trying to run Robinhood for our next Lustre Installation with 2.10 
> which is extremely slow.
>
> I have also tested a fullscan with the same Robinhood client and server boxes 
> of our old Lustre installation with 2.5 and it performs significantly better
> (on the old Lustre installation the avg. speed was about 2000 entries/sec 
> compared to few hundreds).
>
> Just as a comparison the full scan of the new Lustre installation took 
> relatively long...
>
> > Fullscan of old Lustre installation took 3 days for 594M entries.
> > Fullscan of new Lustre installation took 23 days for 533M entries.
>
> I do not understand the cause of the bad performance of the fullscan on our 
> next Lustre installation
> (changelog processing is also slow).
>
> So far no meta data performance or file server issues were detected.
> The meta data operations were tested with 'lfs path2fid'/'ls' on each single 
> MDS and file server synchronous write and read performance were also tested 
> with few million small files from multiple clients in parallel.
>
> Few numbers about our Lustre installations:
>
> NEW INSTALLATION
>
> Lustre Server Version: 2.10.6
> Number of MDTs: 2
> Number of OSTs: 294
>
> OLD INSTALLATION
>
> Lustre Version: 2.5
> Number of MDTs: 2
> Number of OSTs: 630
>
> LUSTRE CLIENT
> Tested with 2.10 and 2.12
>
> SINGLE FOLDER SCAN EXAMPLE
>
> In the following I post few lines of a Robinhood log of a folder scan on the 
> Lustre new installation,
> where each stats dump looks the same except the last dump stats, where the 
> GET_FID stage has non 0 value:
>
> 2019/06/11 15:07:24 [13345/2] FS_Scan | Starting scan of /lustre/xxx
> 2019/06/11 15:07:24 [13345/2] FS_Scan | Notice: this is the first scan (DB is 
> empty)
> 2019/06/11 15:22:24 [13345/3] STATS | ==================== Dumping stats at 
> 2019/06/11 15:22:24 =====================
> 2019/06/11 15:22:24 [13345/3] STATS | ======== General statistics =========
> 2019/06/11 15:22:24 [13345/3] STATS | Daemon start time: 2019/06/11 15:07:24
> 2019/06/11 15:22:24 [13345/3] STATS | Started modules: scan
> 2019/06/11 15:22:24 [13345/3] STATS | ======== FS scan statistics =========
> 2019/06/11 15:22:24 [13345/3] STATS | current scan interval = 18.5d
> 2019/06/11 15:22:24 [13345/3] STATS | scan is running:
> 2019/06/11 15:22:24 [13345/3] STATS |      started at : 2019/06/11 15:07:24 
> (15.0min ago)
> 2019/06/11 15:22:24 [13345/3] STATS |      last action: 2019/06/11 15:22:24 
> (00s ago)
> 2019/06/11 15:22:24 [13345/3] STATS |      progress   : 64932 entries scanned 
> (0 errors)
> 2019/06/11 15:22:24 [13345/3] STATS |      inst. speed (potential):     35.92 
> entries/sec (55.67 ms/entry/thread)
> 2019/06/11 15:22:24 [13345/3] STATS |      avg. speed  (effective):     72.15 
> entries/sec (27.64 ms/entry/thread)
> 2019/06/11 15:22:24 [13345/3] STATS | ==== EntryProcessor Pipeline Stats ===
> 2019/06/11 15:22:24 [13345/3] STATS | Idle threads: 16
> 2019/06/11 15:22:24 [13345/3] STATS | Id constraints count: 0 (hash 
> min=0/max=0/avg=0.0)
> 2019/06/11 15:22:24 [13345/3] STATS | Name constraints count: 0 (hash 
> min=0/max=0/avg=0.0)
> 2019/06/11 15:22:24 [13345/3] STATS | Stage              | Wait | Curr | Done 
> |     Total | ms/op |
> 2019/06/11 15:22:24 [13345/3] STATS |  0: GET_FID        |    0 |    0 |    0 
> |         0 |  0.00 |
> 2019/06/11 15:22:24 [13345/3] STATS |  1: GET_INFO_DB    |    0 |    0 |    0 
> |     64865 |  0.36 |
> 2019/06/11 15:22:24 [13345/3] STATS |  2: GET_INFO_FS    |    0 |    0 |    0 
> |     64865 |  0.01 |
> 2019/06/11 15:22:24 [13345/3] STATS |  3: PRE_APPLY      |    0 |    0 |    0 
> |     64865 |  0.00 |
> 2019/06/11 15:22:24 [13345/3] STATS |  4: DB_APPLY       |    0 |    0 |    0 
> |     64865 |  1.25 | 15.53% batched (avg batch size: 2.4)
> 2019/06/11 15:22:24 [13345/3] STATS |  5: CHGLOG_CLR     |    0 |    0 |    0 
> |         0 |  0.00 |
> 2019/06/11 15:22:24 [13345/3] STATS |  6: RM_OLD_ENTRIES |    0 |    0 |    0 
> |         0 |  0.00 |
> 2019/06/11 15:22:24 [13345/3] STATS | DB ops: get=0/ins=64865/upd=0/rm=0
> 2019/06/11 15:37:24 [13345/3] STATS | ==================== Dumping stats at 
> 2019/06/11 15:37:24 =====================
> 2019/06/11 15:37:24 [13345/3] STATS | ======== General statistics =========
> 2019/06/11 15:37:24 [13345/3] STATS | Daemon start time: 2019/06/11 15:07:24
> 2019/06/11 15:37:24 [13345/3] STATS | Started modules: scan
> 2019/06/11 15:37:24 [13345/3] STATS | ======== FS scan statistics =========
> 2019/06/11 15:37:24 [13345/3] STATS | current scan interval = 18.5d
> 2019/06/11 15:37:24 [13345/3] STATS | scan is running:
> 2019/06/11 15:37:24 [13345/3] STATS |      started at : 2019/06/11 15:07:24 
> (30.0min ago)
> 2019/06/11 15:37:24 [13345/3] STATS |      last action: 2019/06/11 15:37:24 
> (00s ago)
> 2019/06/11 15:37:24 [13345/3] STATS |      progress   : 135638 entries 
> scanned (0 errors)
> 2019/06/11 15:37:24 [13345/3] STATS |      inst. speed (potential):     48.26 
> entries/sec (41.45 ms/entry/thread)
> 2019/06/11 15:37:24 [13345/3] STATS |      avg. speed  (effective):     75.35 
> entries/sec (26.47 ms/entry/thread)
> 2019/06/11 15:37:24 [13345/3] STATS | ==== EntryProcessor Pipeline Stats ===
> 2019/06/11 15:37:24 [13345/3] STATS | Idle threads: 16
> 2019/06/11 15:37:24 [13345/3] STATS | Id constraints count: 0 (hash 
> min=0/max=0/avg=0.0)
> 2019/06/11 15:37:24 [13345/3] STATS | Name constraints count: 0 (hash 
> min=0/max=0/avg=0.0)
> 2019/06/11 15:37:24 [13345/3] STATS | Stage              | Wait | Curr | Done 
> |     Total | ms/op |
> 2019/06/11 15:37:24 [13345/3] STATS |  0: GET_FID        |    0 |    0 |    0 
> |         0 |  0.00 |
> 2019/06/11 15:37:24 [13345/3] STATS |  1: GET_INFO_DB    |    0 |    0 |    0 
> |     64445 |  0.36 |
> 2019/06/11 15:37:24 [13345/3] STATS |  2: GET_INFO_FS    |    0 |    0 |    0 
> |     64445 |  0.01 |
> 2019/06/11 15:37:24 [13345/3] STATS |  3: PRE_APPLY      |    0 |    0 |    0 
> |     64445 |  0.00 |
> 2019/06/11 15:37:24 [13345/3] STATS |  4: DB_APPLY       |    0 |    0 |    0 
> |     64445 |  1.28 | 13.67% batched (avg batch size: 2.3)
> 2019/06/11 15:37:24 [13345/3] STATS |  5: CHGLOG_CLR     |    0 |    0 |    0 
> |         0 |  0.00 |
> 2019/06/11 15:37:24 [13345/3] STATS |  6: RM_OLD_ENTRIES |    0 |    0 |    0 
> |         0 |  0.00 |
> 2019/06/11 15:37:24 [13345/3] STATS | DB ops: get=0/ins=129310/upd=0/rm=0
> 2019/06/11 15:52:24 [13345/3] STATS | ==================== Dumping stats at 
> 2019/06/11 15:52:24 =====================
> 2019/06/11 15:52:24 [13345/3] STATS | ======== General statistics =========
> 2019/06/11 15:52:24 [13345/3] STATS | Daemon start time: 2019/06/11 15:07:24
> 2019/06/11 15:52:24 [13345/3] STATS | Started modules: scan
> 2019/06/11 15:52:24 [13345/3] STATS | ======== FS scan statistics =========
> 2019/06/11 15:52:24 [13345/3] STATS | current scan interval = 18.5d
> 2019/06/11 15:52:24 [13345/3] STATS | scan is running:
> 2019/06/11 15:52:24 [13345/3] STATS |      started at : 2019/06/11 15:07:24 
> (45.0min ago)
> 2019/06/11 15:52:24 [13345/3] STATS |      last action: 2019/06/11 15:52:24 
> (00s ago)
> 2019/06/11 15:52:24 [13345/3] STATS |      progress   : 202031 entries 
> scanned (0 errors)
> 2019/06/11 15:52:24 [13345/3] STATS |      inst. speed (potential):     38.16 
> entries/sec (52.41 ms/entry/thread)
> 2019/06/11 15:52:24 [13345/3] STATS |      avg. speed  (effective):     74.83 
> entries/sec (26.65 ms/entry/thread)
> 2019/06/11 15:52:24 [13345/3] STATS | ==== EntryProcessor Pipeline Stats ===
> 2019/06/11 15:52:24 [13345/3] STATS | Idle threads: 16
> 2019/06/11 15:52:24 [13345/3] STATS | Id constraints count: 0 (hash 
> min=0/max=0/avg=0.0)
> 2019/06/11 15:52:24 [13345/3] STATS | Name constraints count: 0 (hash 
> min=0/max=0/avg=0.0)
> 2019/06/11 15:52:24 [13345/3] STATS | Stage              | Wait | Curr | Done 
> |     Total | ms/op |
> 2019/06/11 15:52:24 [13345/3] STATS |  0: GET_FID        |    0 |    0 |    0 
> |         0 |  0.00 |
> 2019/06/11 15:52:24 [13345/3] STATS |  1: GET_INFO_DB    |    0 |    0 |    0 
> |     67000 |  0.36 |
> 2019/06/11 15:52:24 [13345/3] STATS |  2: GET_INFO_FS    |    0 |    0 |    0 
> |     67000 |  0.01 |
> 2019/06/11 15:52:24 [13345/3] STATS |  3: PRE_APPLY      |    0 |    0 |    0 
> |     67000 |  0.00 |
> 2019/06/11 15:52:24 [13345/3] STATS |  4: DB_APPLY       |    0 |    0 |    0 
> |     67000 |  1.27 | 12.16% batched (avg batch size: 2.4)
> 2019/06/11 15:52:24 [13345/3] STATS |  5: CHGLOG_CLR     |    0 |    0 |    0 
> |         0 |  0.00 |
> 2019/06/11 15:52:24 [13345/3] STATS |  6: RM_OLD_ENTRIES |    0 |    0 |    0 
> |         0 |  0.00 |
> 2019/06/11 15:52:24 [13345/3] STATS | DB ops: get=0/ins=196310/upd=0/rm=0
> 2019/06/12 01:07:24 [13345/3] STATS | ==================== Dumping stats at 
> 2019/06/12 01:07:24 =====================
> 2019/06/12 01:07:24 [13345/3] STATS | ======== General statistics =========
> 2019/06/12 01:07:24 [13345/3] STATS | Daemon start time: 2019/06/11 15:07:24
> 2019/06/12 01:07:24 [13345/3] STATS | Started modules: scan
> 2019/06/12 01:07:24 [13345/3] STATS | ======== FS scan statistics =========
> 2019/06/12 01:07:24 [13345/3] STATS | current scan interval = 18.5d
> 2019/06/12 01:07:24 [13345/3] STATS | scan is running:
> 2019/06/12 01:07:24 [13345/3] STATS |      started at : 2019/06/11 15:07:24 
> (10.0h ago)
> 2019/06/12 01:07:24 [13345/3] STATS |      last action: 2019/06/12 01:07:24 
> (00s ago)
> 2019/06/12 01:07:24 [13345/3] STATS |      progress   : 3840469 entries 
> scanned (0 errors)
> 2019/06/12 01:07:24 [13345/3] STATS |      inst. speed (potential):     30.58 
> entries/sec (65.39 ms/entry/thread)
> 2019/06/12 01:07:24 [13345/3] STATS |      avg. speed  (effective):    106.68 
> entries/sec (18.69 ms/entry/thread)
> 2019/06/12 01:07:24 [13345/3] STATS | ==== EntryProcessor Pipeline Stats ===
> 2019/06/12 01:07:24 [13345/3] STATS | Idle threads: 16
> 2019/06/12 01:07:24 [13345/3] STATS | Id constraints count: 0 (hash 
> min=0/max=0/avg=0.0)
> 2019/06/12 01:07:24 [13345/3] STATS | Name constraints count: 0 (hash 
> min=0/max=0/avg=0.0)
> 2019/06/12 01:07:24 [13345/3] STATS | Stage              | Wait | Curr | Done 
> |     Total | ms/op |
> 2019/06/12 01:07:24 [13345/3] STATS |  0: GET_FID        |    0 |    0 |    0 
> |         0 |  0.00 |
> 2019/06/12 01:07:24 [13345/3] STATS |  1: GET_INFO_DB    |    0 |    0 |    0 
> |     67331 |  0.36 |
> 2019/06/12 01:07:24 [13345/3] STATS |  2: GET_INFO_FS    |    0 |    0 |    0 
> |     67331 |  0.01 |
> 2019/06/12 01:07:24 [13345/3] STATS |  3: PRE_APPLY      |    0 |    0 |    0 
> |     67331 |  0.00 |
> 2019/06/12 01:07:24 [13345/3] STATS |  4: DB_APPLY       |    0 |    0 |    0 
> |     67331 |  1.29 | 12.56% batched (avg batch size: 2.4)
> 2019/06/12 01:07:24 [13345/3] STATS |  5: CHGLOG_CLR     |    0 |    0 |    0 
> |         0 |  0.00 |
> 2019/06/12 01:07:24 [13345/3] STATS |  6: RM_OLD_ENTRIES |    0 |    0 |    0 
> |         0 |  0.00 |
> 2019/06/12 01:07:24 [13345/3] STATS | DB ops: get=0/ins=3838009/upd=0/rm=0
> ...
> 2019/06/12 01:21:21 [13345/4] FS_Scan | Full scan of /lustre/alice completed, 
> 3948736 entries found (0 errors). Duration = 36837.18s
> 2019/06/12 01:21:21 [13345/4] FS_Scan | Flushing pipeline...
> 2019/06/12 01:21:26 [13345/4] FS_Scan | File list of /lustre/alice has been 
> updated
> 2019/06/12 01:21:26 [13345/1] Main | FS Scan finished
> In general each fullscan of the new Lustre installation is very slow compared 
> to the old installation.
>
> We have also done a fullscan of both the new and old Lustre installation with 
> the same Robinhood client and server setup.
>
> 2019/06/12 01:21:26 [13345/1] EntryProc | Pipeline successfully flushed
> 2019/06/12 01:21:26 [13345/1] STATS | ==== EntryProcessor Pipeline Stats ===
> 2019/06/12 01:21:26 [13345/1] STATS | Idle threads: 0
> 2019/06/12 01:21:26 [13345/1] STATS | Id constraints count: 0 (hash 
> min=0/max=0/avg=0.0)
> 2019/06/12 01:21:26 [13345/1] STATS | Name constraints count: 0 (hash 
> min=0/max=0/avg=0.0)
> 2019/06/12 01:21:26 [13345/1] STATS | Stage              | Wait | Curr | Done 
> |     Total | ms/op |
> 2019/06/12 01:21:26 [13345/1] STATS |  0: GET_FID        |    0 |    0 |    0 
> |     43782 |  0.72 |
> 2019/06/12 01:21:26 [13345/1] STATS |  1: GET_INFO_DB    |    0 |    0 |    0 
> |    110728 |  0.33 |
> 2019/06/12 01:21:26 [13345/1] STATS |  2: GET_INFO_FS    |    0 |    0 |    0 
> |    110728 |  0.23 |
> 2019/06/12 01:21:26 [13345/1] STATS |  3: PRE_APPLY      |    0 |    0 |    0 
> |    110728 |  0.00 |
> 2019/06/12 01:21:26 [13345/1] STATS |  4: DB_APPLY       |    0 |    0 |    0 
> |    110728 |  0.66 | 61.59% batched (avg batch size: 18.9)
> 2019/06/12 01:21:26 [13345/1] STATS |  5: CHGLOG_CLR     |    0 |    0 |    0 
> |         0 |  0.00 |
> 2019/06/12 01:21:26 [13345/1] STATS |  6: RM_OLD_ENTRIES |    0 |    0 |    0 
> |         1 | 4566.98 |
> 2019/06/12 01:21:26 [13345/1] STATS | DB ops: get=0/ins=3948737/upd=0/rm=0
> 2019/06/12 01:21:26 [13345/1] Main | All tasks done! Exiting.
>
> What do you think?
> Is there anything suspicious about the dumped stats that could hint to a very 
> slow processing?
> Is it a problem that the GET_FID stages values are 0 except in the last 
> dumped stats?
>
> Let me know if you need more information.
>
> We would appreciate your help.
>
>
> Best regards
> Gabriele Iannetti
>
> _______________________________________________
> robinhood-support mailing list
> robinhood-support@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/robinhood-support
>



_______________________________________________
robinhood-support mailing list
robinhood-support@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/robinhood-support

Reply via email to