On Wed, Nov 26, 2014 at 12:48 PM, Thomas LEIBOVICI <[email protected]>
wrote:

>  Le 26/11/2014 18:19, Craig Tierney - NOAA Affiliate a écrit :
>
> Thomas,
>
>  We backported the patch.  It was just a one-liner to put changlog
> entries at the tail, versus the head, of the list.  After the last catchup
> of the changelogs completed, I created a bunch of new files while robinhood
> was not running.  The processing rate is still about 400 entries per
> second.   In particular, it looked like it was processing about 1024
> records every 2.5 seconds.
>
>  So I looked in the configuration and saw that I had:
>
>    # clear changelog every 1024 records:
>     batch_ack_count = 1024 ;
>
> Craig,
>
> This is strange. The behavior you describe sounds exactly like the problem
> that must be fixed with the patch:
> every changelog_clear() call to the MDS stucks changelog delivery for a
> while.
>
> Is there a lot of stacked records? You can see this on the MDS, as far I I
> can remember, in /proc/fs/lustre/*mdd*/changelog_user something like that,
> you have the last record id and the last cleared record.
>
>
> What I have been doing to determine the changlog processing rate is to use
the changelog_user information.   For example:


[root@lfs-mds-2-1 ~]# !cat
cat /proc/fs/lustre/mdd/lfs2-MDT0000/changelog_users ; sleep 30 ; cat
/proc/fs/lustre/mdd/lfs2-MDT0000/changelog_users
current index: 265951473
ID    index
cl1   265796018
current index: 265951473
ID    index
cl1   265816018

Even though the backlog is 155k changelogs, and the batch_ack_count is
10000, nothing changed over 30 seconds.   In the stats, I see:

2014/11/26 20:59:19 [14613/2] STATS | ChangeLog reader #0:
2014/11/26 20:59:19 [14613/2] STATS |    fs_name    =   lfs2
2014/11/26 20:59:19 [14613/2] STATS |    mdt_name   =   MDT0000
2014/11/26 20:59:19 [14613/2] STATS |    reader_id  =   cl1
2014/11/26 20:59:19 [14613/2] STATS |    records read        = 117135
2014/11/26 20:59:19 [14613/2] STATS |    interesting records = 117135
2014/11/26 20:59:19 [14613/2] STATS |    suppressed records  = 0
2014/11/26 20:59:19 [14613/2] STATS |    records pending     = 0
2014/11/26 20:59:19 [14613/2] STATS |    last received            =
2014/11/26 20:59:17
2014/11/26 20:59:19 [14613/2] STATS |    last read record time    =
2014/11/26 20:55:06.685525
2014/11/26 20:59:19 [14613/2] STATS |    last read record id      =
265806017
2014/11/26 20:59:19 [14613/2] STATS |    last pushed record id    =
265806017
2014/11/26 20:59:19 [14613/2] STATS |    last committed record id =
265796017
2014/11/26 20:59:19 [14613/2] STATS |    last cleared record id   =
265796017
2014/11/26 20:59:19 [14613/2] STATS |    read speed               = 672.94
record/sec (247.03 incl. idle time)
2014/11/26 20:59:19 [14613/2] STATS |    processing speed ratio   = 20.92
2014/11/26 20:59:19 [14613/2] STATS |    status                   =
terminating
2014/11/26 20:59:19 [14613/2] STATS |    ChangeLog stats:
2014/11/26 20:59:19 [14613/2] STATS |    MARK: 0, CREAT: 44, MKDIR: 0,
HLINK: 0, SLINK: 0, MKNOD: 0, UNLNK: 110499, RMDIR: 6592
2014/11/26 20:59:19 [14613/2] STATS |    RENME: 0, RNMTO: 0, OPEN: 0,
CLOSE: 0, LYOUT: 0, TRUNC: 0, SATTR: 0, XATTR: 0, HSM: 0
2014/11/26 20:59:19 [14613/2] STATS |    MTIME: 0, CTIME: 0, ATIME: 0

But right now it seems to be stuck.

Craig

 I don't know why this would slow things down, I thought it was just an
> update optimization.  I ran some tests with a different changelog user and
> it seemed dumping the changelogs and updating the position should never be
> a limitation as I was able to grab over 100,000 entries and reset the count
> in a few seconds.
>
> OK.
>
>
>  So I updated batch_ack_count to 10,000.  Now the change log processing
> rate seemed to go up to 1666 logs/second (over 30 seconds).   This is
> better.  If the rate is limited by the database performance, then there
> probably isn't much more I can do (comparing to scan rates).
>
> "grep STAT" into robinhood log would help to indentify the limitation you
> hit.
> If you want to sample stats for a shorter period that the default (which
> is 15 or 20minutes), you can change the "stats_interval" in the config.
>
>
>  What do people use for a value of batch_ack_count on large, PB sized,
> filesystems?
>
> I think a good value is a few seconds of changelog processing. So 10k is a
> good value in you case.
>
>
> Regards
>
>
>  Thanks,
> Craig
>
>
> On Tue, Nov 18, 2014 at 3:00 AM, LEIBOVICI Thomas <[email protected]
> > wrote:
>
>>  Hi Craig,
>>
>> No, it is njot expected to get such a slow processing speed.
>> According to the Lustre versions you run, this slow processing may be due
>> to the following Lustre bug:
>>
>> https://jira.hpdd.intel.com/browse/LU-5405
>>
>> It is a MDS fix. For now the fix is only landed in Lustre 2.5.4. I don't
>> know if it can be backported to Lustre2.4...
>>
>> Regards,
>> Thomas
>>
>>
>> On 11/17/14 21:11, Craig Tierney - NOAA Affiliate wrote:
>>
>>  Hi,
>>
>>  I have just installed Robinhood 2.5.3 to monitor a Lustre 2.4.3
>> system.  The client on the server is running the 2.5.3 version.  When I did
>> an initlal scan of another test system I saw scan rates of about 1000-2000
>> entries per second. While I had configured robinhood to monitor this new
>> system, the Robinhood server was not running when we started to copy data
>> to the new filesystem.  From the changelog statistics, I am about 144m
>> events behind.  Processing the change logs seems only be going at 375
>> entries per second.
>>
>>  Is this typical?  I would have expected the processing of changelog
>> events to be much faster than this or at least as fast as a normal file
>> scan.
>>
>>  Thanks,
>> Craig
>>
>>
>>  
>> ------------------------------------------------------------------------------
>> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
>> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
>> with Interactivity, Sharing, Native Excel Exports, App Integration & more
>> Get technology previously reserved for billion-dollar corporations, 
>> FREEhttp://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
>>
>>
>>
>> _______________________________________________
>> robinhood-support mailing 
>> [email protected]https://lists.sourceforge.net/lists/listinfo/robinhood-support
>>
>>
>>
>
>
> ------------------------------------------------------------------------------
> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
> with Interactivity, Sharing, Native Excel Exports, App Integration & more
> Get technology previously reserved for billion-dollar corporations, 
> FREEhttp://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
>
>
>
> _______________________________________________
> robinhood-support mailing 
> [email protected]https://lists.sourceforge.net/lists/listinfo/robinhood-support
>
>
>
>
> ------------------------------
>    <http://www.avast.com/>
>
> Ce courrier électronique ne contient aucun virus ou logiciel malveillant
> parce que la protection Antivirus avast! <http://www.avast.com/> est
> active.
>
>
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
_______________________________________________
robinhood-support mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/robinhood-support

Reply via email to