Re: [robinhood-support] How to improve changelog processing rates

Nathan Dauchy - NOAA Affiliate Mon, 08 Feb 2016 09:02:54 -0800

On Sun, Feb 7, 2016 at 3:48 PM, Chris Hunter <[email protected]> wrote:


> The behaviour you describe when doing fulls scans, the rate starts high
> and then slows over time, is similar to what I found. This slow-down
> happens when the system cache/buffers fill, scan rate is then limited by
> the timeout/aging rate of the lustre lru locks. If you "dump" the system
> cache via systctl param "vm.drop_caches" the rate should speed up
> temporarily (FWIW this will also dump your SQL DB buffers). You may wish
> to investigate sysctl param "vfs_cache_pressure" and lustre ldlm params
> "lru_size_ & "lru_max_age".
>
> regards,
> chris hunter
>
>
Chris,

Yes, I saw your posting on this list a while back, and suggestions from
other helpful folks, so we have already been running with the following...

* sysctl tuning:
    vm.vfs_cache_pressure = 150

* Lustre parameters:
    llite.*.statahead_max=4
    mdc.*.max_rpcs_in_flight=64
    ldlm.namespaces.*osc*.lru_max_age=1200
    ldlm.namespaces.*osc*.lru_size=100

* Dropping inode cache daily, with:
    echo 2 > /proc/sys/vm/drop_caches

So either there is something else that you are doing that we are not, or we
are hitting a different issue entirely.

Also, please note that what I'm concerned with is NOT the rate of full
scans.  I don't (much) care how long those take.  Rather, it is the
changelog processing rate which is too slow.  My hunch is that it is in the
Lustre changelog handling itself, but I don't (yet) have a good way to test
changelog performance without Robinhood.

Thanks,
Nathan



> > Date: Sun, 7 Feb 2016 12:00:52 -0700
> > From: Nathan Dauchy - NOAA Affiliate<[email protected]>
> > Subject: Re: [robinhood-support] How to improve changelog processing
> >       rates
> >
> > Colin,
> >
> > Thanks for your reply.  I look forward to hearing what "best practices"
> you
> > come up with for improving performance.  We have already used the
> > http://mysqltuner.pl/  script, and matched the server memory to the DB
> size,
> > but not done much beyond that.
> >
> > That said, I don't think the problem with the system is the database.
> When
> > running a full scan, we get "average speed:   3018.87 entries/sec",
> which I
> > would have thought is more stressful on the DB than changelog processing
> > and yet runs several times faster.
> >
> > I also just tried another test, since our changelog backlog was so bad
> and
> > getting worse, I figure we can pick up the missed bits with a full scan
> > later.  I went ahead and cleared the changelogs.  That seemed to let
> > Robinhood progress much faster, at least for a little while.
> >
> > [root@sherwood ~]# lfs changelog_clear lfs3-MDT0000 cl1 0
> >
> > Before the clear:
> >
> > 2016/02/04 19:49:21 [38629/1] STATS |    read speed               =
> 484.42
> > record/sec
> > 2016/02/04 20:04:21 [38629/1] STATS |    read speed               =
> 448.04
> > record/sec
> > 2016/02/04 20:19:21 [38629/1] STATS |    read speed               =
> 458.21
> > record/sec
> > 2016/02/04 20:34:21 [38629/1] STATS |    read speed               =
> 464.67
> > record/sec
> > 2016/02/04 20:49:21 [38629/1] STATS |    read speed               =
> 460.15
> > record/sec
> > 2016/02/04 21:04:21 [38629/1] STATS |    read speed               =
> 473.90
> > record/sec
> >
> > After the clear:
> >
> > 2016/02/04 21:19:21 [38629/1] STATS |    read speed               =
> > 30047.39 record/sec (14756.61 incl. idle time)
> > 2016/02/04 21:34:22 [38629/1] STATS |    read speed               =
> > 16627.84 record/sec (942.24 incl. idle time)
> > 2016/02/04 21:49:22 [38629/1] STATS |    read speed               =
> 1552.43
> > record/sec (1216.07 incl. idle time)
> > 2016/02/04 22:04:22 [38629/1] STATS |    read speed               =
> 2077.57
> > record/sec (1041.09 incl. idle time)
> > 2016/02/04 22:19:22 [38629/1] STATS |    read speed               =
> 3987.93
> > record/sec (908.36 incl. idle time)
> > 2016/02/04 22:34:22 [38629/1] STATS |    read speed               =
> 2482.72
> > record/sec (918.61 incl. idle time)
> > 2016/02/04 22:49:22 [38629/1] STATS |    read speed               =
> 1997.80
> > record/sec (854.61 incl. idle time)
> > 2016/02/04 23:04:22 [38629/1] STATS |    read speed               =
> 2259.46
> > record/sec (1031.82 incl. idle time)
> > 2016/02/04 23:19:22 [38629/1] STATS |    read speed               =
> 1521.73
> > record/sec (953.62 incl. idle time)
> > 2016/02/04 23:34:22 [38629/1] STATS |    read speed               =
> 1321.58
> > record/sec (879.58 incl. idle time)
> >
> > Current counters:
> >
> > [root@lfs-mds-3-1 ~]# cat/proc/fs/lustre/mdd/*MDT*/changelog_users
> > current index: 591196749
> > ID    index
> > cl1   591164109
> >
> > So, a backlog of only 32640 after running for 2.5 hours, which is WAY
> > better than it was before.
> >
> > That sounds a lot likehttps://jira.hpdd.intel.com/browse/LU-5405.  So,
> > just to be sure, I went back and verified our lustre server source...
> >
> > [root@lfs-mds-3-1 lustre-2.5.37.ddn1]# grep -n -m 1 -C 2 cfs_list_add
> > lustre/obdclass/llog_cat.c
> > 202-
> > 203-    down_write(&cathandle->lgh_lock);
> > 204:    cfs_list_add_tail(&loghandle->u.phd.phd_entry,
> > &cathandle->u.chd.chd_head);
> > 205-    up_write(&cathandle->lgh_lock);
> > 206-
> >
> > ...and that looks like we DO have the fix.
> >
> >
> > So, is there perhaps another bug here in changelog processing?
> >
> > Other tuning that I'm missing?
> >
> > Thanks,
> > Nathan
> >
> >
> > On Thu, Feb 4, 2016 at 1:57 PM, Colin Faber<[email protected]>  wrote:
> >
> >> Hi Nathan,
> >>
> >> I'm actively working on a similar type of benchmarking activities. From
> >> what I've found so far, as most of my slowdowns are within the database
> >> processing itself. This can be observed by utilizing db benchmarking
> tools
> >> (sysbench with oltp testing).
> >>
> >>  From what I've found robinhood usually sees double the best case
> >> performance (transactions per second) on a complex and read only
> workload
> >> over a million record set.
> >>
> >> These rates are pretty bad, even on beefy hardware, once you've solved
> >> your database performance problems then you get to move onto issues with
> >> FID lookup. There are a few tickets (which I don't have off hand) to
> >> improve this (multiple FID lookups per request, etc), but it's still
> slow.
> >>
> >> In general though changelog reading shouldn't be a bottleneck, even on
> my
> >> single processor E5-2609 based system I'm able to read almost 100k
> records
> >> / second.
> >>
> >> -cf
> >>
> >>
> >> On Thu, Feb 4, 2016 at 1:40 PM, Nathan Dauchy - NOAA Affiliate <
> >> [email protected]> wrote:
> >>
> >>> Greetings Robinhood Developers and Users,
> >>>
> >>> It seems as though Robinhood changelog processing is having trouble
> >>> keeping up on our system.  From reading mailing list threads, I
> believe we
> >>> should be expecting higher processing rates.  However, trials of
> various
> >>> suggestions have not yielded much better performance.  Does anyone have
> >>> guidance on how to identify exactly where the bottleneck is, or
> suggestions
> >>> on how to speed things up?
> >>>

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140

_______________________________________________
robinhood-support mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/robinhood-support

Re: [robinhood-support] How to improve changelog processing rates

Reply via email to