2016/09/09 09:12:46 [21484/2] STATS | 1: GET_INFO_DB | 9990 | 0 | 0 | 1819630 | 0.20 | 2016/09/09 09:12:46 [21484/2] STATS | 2: GET_INFO_FS | 0 | 10 | 0 | 1819613 | 3.75 |

It shows robinhood threads are busy running FS operations, that's why there are incoming records waiting to pass the GET_INFO_DB step, but doesn't means there is a problem about the DB performance. FS latency is a little high (3.75ms), but not catastrophic is it runs 10 in parallel.

If not already done, you can tune Lustre osc and mdc's max_rpcs_in_flight:
/proc/fs/lustre/*/*/max_rpcs_in_flight

Also consider tuning peer_credits of ko2iblnd module, but this requires to tune it on servers and lnet routers too.

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.02    0.00   80.22    0.00    0.00   19.76
This indicates most of the CPU load is consumed by the system.
It seams there is a lot of activity in the lustre client.

A possible explanation is that it manages a too large MD cache.
This can be reduced by tuning lru_size (usually between 100 and 400):
/proc/fs/lustre/ldlm/namespaces/*/lru_size

However, I think the situation will become better once robinhood will have dequeued all changelog accumulated during your initial scan. It is currently processing records emitted 1day 1/2 ago, but process them 4 times faster than their accumulation speed, so it should finally process them all. 2016/09/09 09:12:46 [21484/2] STATS | last read record time = 2016/09/07 20:28:30.099225
2016/09/09 09:12:46 [21484/2] STATS |    processing speed ratio   = 4.00

Regards,
Thomas

On 09/09/16 09:24, Marcin Stolarek wrote:
OK, I've tried a few times. Did some mysql tuning and what I have now is:

2016/09/09 09:12:46 [21484/2] STATS | ==================== Dumping stats at 2016/09/09 09:12:46 ===================== 2016/09/09 09:12:46 [21484/2] STATS | ======== General statistics ========= 2016/09/09 09:12:46 [21484/2] STATS | Daemon start time: 2016/09/09 08:42:45
2016/09/09 09:12:46 [21484/2] STATS | Started modules: log_reader
2016/09/09 09:12:46 [21484/2] STATS | ChangeLog reader #0:
2016/09/09 09:12:46 [21484/2] STATS |    fs_name    = plkra
2016/09/09 09:12:46 [21484/2] STATS |    mdt_name   = MDT0000
2016/09/09 09:12:46 [21484/2] STATS |    reader_id  = cl1
2016/09/09 09:12:46 [21484/2] STATS |    records read    = 1830902
2016/09/09 09:12:46 [21484/2] STATS |    interesting records = 1830620
2016/09/09 09:12:46 [21484/2] STATS |    suppressed records  = 282
2016/09/09 09:12:46 [21484/2] STATS |    records pending     = 1000
2016/09/09 09:12:46 [21484/2] STATS | last received = 2016/09/09 09:04:13 2016/09/09 09:12:46 [21484/2] STATS | last read record time = 2016/09/07 20:28:30.099225 2016/09/09 09:12:46 [21484/2] STATS | last read record id = 18391116 2016/09/09 09:12:46 [21484/2] STATS | last pushed record id = 18390117 2016/09/09 09:12:46 [21484/2] STATS | last committed record id = 18380083 2016/09/09 09:12:46 [21484/2] STATS | last cleared record id = 18379864 2016/09/09 09:12:46 [21484/2] STATS | read speed = 677.18 record/sec
2016/09/09 09:12:46 [21484/2] STATS |    processing speed ratio   = 4.00
2016/09/09 09:12:46 [21484/2] STATS |    status         = busy
2016/09/09 09:12:46 [21484/2] STATS |    ChangeLog stats:
2016/09/09 09:12:46 [21484/2] STATS | MARK: 0, CREAT: 262, MKDIR: 9, HLINK: 0, SLINK: 0, MKNOD: 0, UNLNK: 7, RMDIR: 0, RENME: 0 2016/09/09 09:12:46 [21484/2] STATS | RNMTO: 0, OPEN: 0, CLOSE: 0, LYOUT: 0, TRUNC: 48, SATTR: 1830576, XATTR: 0, HSM: 0
2016/09/09 09:12:46 [21484/2] STATS |    MTIME: 0, CTIME: 0, ATIME: 0
2016/09/09 09:12:46 [21484/2] STATS | ==== EntryProcessor Pipeline Stats ===
2016/09/09 09:12:46 [21484/2] STATS | Idle threads: 0
2016/09/09 09:12:46 [21484/2] STATS | Id constraints count: 10000 (hash min=0/max=6/avg=0.6) 2016/09/09 09:12:46 [21484/2] STATS | Name constraints count: 18 (hash min=0/max=1/avg=0.0) 2016/09/09 09:12:46 [21484/2] STATS | Stage | Wait | Curr | Done | Total | ms/op | 2016/09/09 09:12:46 [21484/2] STATS | 0: GET_FID | 0 | 0 | 0 | 0 | 0.00 | 2016/09/09 09:12:46 [21484/2] STATS | 1: GET_INFO_DB | 9990 | 0 | 0 | 1819630 | 0.20 | 2016/09/09 09:12:46 [21484/2] STATS | 2: GET_INFO_FS | 0 | 10 | 0 | 1819613 | 3.75 | 2016/09/09 09:12:46 [21484/2] STATS | 3: PRE_APPLY | 0 | 0 | 0 | 1819606 | 0.00 | 2016/09/09 09:12:46 [21484/2] STATS | 4: DB_APPLY | 0 | 0 | 0 | 1819606 | 0.17 | 99.76% batched (avg batch size: 17.0) 2016/09/09 09:12:46 [21484/2] STATS | 5: CHGLOG_CLR | 0 | 0 | 0 | 1819620 | 0.01 | 2016/09/09 09:12:46 [21484/2] STATS | 6: RM_OLD_ENTRIES | 0 | 0 | 0 | 0 | 0.00 | 2016/09/09 09:12:46 [21484/2] STATS | DB ops: get=1819360/ins=246/upd=1819360/rm=0
2016/09/09 09:12:46 [21484/2] STATS | --- Pipeline stage details ---
2016/09/09 09:12:46 [21484/2] STATS | GET_INFO_DB : first: changelog record #18380115, fid=[0x2000291ab:0xdb7b:0x0], status=waiting 2016/09/09 09:12:46 [21484/2] STATS | GET_INFO_DB : last: changelog record #18390116, fid=[0x200002eae:0x1b148:0x0], status=waiting 2016/09/09 09:12:46 [21484/2] STATS | GET_INFO_FS : first: changelog record #18380085, fid=[0x200025ab0:0x1a350:0x0], status=processing 2016/09/09 09:12:46 [21484/2] STATS | GET_INFO_FS : last: changelog record #18380112, fid=[0x2000291ab:0xdb7a:0x0], status=processing


Whay do someone experienced with robinhood can read from this? I see that I have a lot of waiting GET_INFO_DB operations - does it mean my database is responding slow? Should I check slow queries?

My server responsiveness is limited :)

What I have from I/O stat is:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.02    0.00   80.22    0.00    0.00   19.76

Device:            tps    MB_read/s    MB_wrtn/s  MB_read    MB_wrtn
sda               1.88         0.00         0.02  0          0


cheers,
Marcin

2016-09-07 13:31 GMT+02:00 LEIBOVICI Thomas <[email protected] <mailto:[email protected]>>:

    On 09/07/16 13:03, Marcin Stolarek wrote:
    Hi,

    I installed robinhood 3.0-rc1 with lustre 2.5. Initial scan of
    the filesystem took around 2 days. For this time the changelog
    was enabled.
    Today I started:
    #robinhood --readlog --once
     and after a few hours robinhood process started to use a lot of
    CPU (around 700% - everything available). Is this normal?
    Why not... if it's for doing useful things, such as processing
    bursts of changelogs :)
    But I'd say it may also be the sign of something wrong...
    You can check if the changelog processing is going well and how it
    evolved:
    grep STATS /var/log/robinhood.log

    Thomas

    cheers,
    Marcin


    
------------------------------------------------------------------------------

    _______________________________________________
    robinhood-support mailing list
    [email protected]
    <mailto:[email protected]>
    https://lists.sourceforge.net/lists/listinfo/robinhood-support
    <https://lists.sourceforge.net/lists/listinfo/robinhood-support>

------------------------------------------------------------------------------
_______________________________________________
robinhood-support mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/robinhood-support

Reply via email to