Re: [Bug 3099] Please parallelize filesystem scan

Ken Chase Fri, 17 Jul 2015 08:57:56 -0700

Sounds to me like maintaining the metadata cache is important - and tuning the
filesystem to do so would be more beneficial than caching writes, especially
with a backup target where a write already written will likely never be read
again (and isnt a big deal if it is since so few files are changed compared to
the total # of inodes to scan).


Your report of the minutes for the re-sync shows the unthrashed cache is highly
valuable. So all we need to do is tune the backup target (and even the 
operational
servers themselves) to maintain more metadata. I dont know how much ram is used
per inode, but I'd throw in another 4-8gb just for metadata caching per box, or
even more, if it meant scanning was sped up.

(Really, actually, one only needs it in the backup target - if you can run all
the backups in parallel, and there's N servers to backup, they can all run at 
1/N
speed, as long as scanning metadata on the backup target is fast enough to keep
up with it all -- my total data written is only 20-30GB for example, which at 
reasonable
speed (20-30MB/s even, which is slow) is only 15 minutes total writing. Even 
200-300GB
changed would be 150 minutes at that rate, and the rate could easily be 4x 
faster.

So, tuning caches to prefer metadata seems to be key. How?

As we've discussed before, letting the filesystem at it throws away precious
metadata cache, and so tracking your own changes (since the backup system will 
never
be used for anything else, right? :) would be beneficial. Of course the danger
is using the backup system for anything else and changing any of the target 
info -
inconsistencies would crop up and make the backup worthless very quickly.

/kc

On Fri, Jul 17, 2015 at 03:18:02PM +0000, Schweiss, Chip said:
  >Modern file systems have many internal queues, and service many clients 
simultaneously.  They arrange their work to maximize throughput in both read 
and write operations.    This is the norm on any enterprise file system, be it 
Hitachi, Oracle, Dell, HP, Isilon, etc.  You will get significantly higher 
throughput if you hit it with multiple threads.   These systems have elaborate 
predictive read ahead caches and perform best when multiple threads hit them.
  >
  >Using the test case of a single server with a simple file system such as 
ext3/4, or xfs, no gains will be seen in multithreading rsync.   Use an 
enterprise file system with 100's of TBs and the more threads you use the 
faster you will go.   Metadata and data on these systems ends up across 100's 
of disks.   Single threads end up severely bound by latency.  This is why 
multi-threading should be optional.  It doesn't help everyone.
  >
  >For example, one of my rsync jobs moving from a ZFS system in St. Louis, 
Missouri to a Hitachi HNAS in Minneapolis, Minnesota has over 100 million 
files.   Each day 50 to 100 thousand files get added or updated.   A single 
rsync job would take weeks to parse this job and send the changes.   I split it 
into 120 jobs and it typically completes in 2 hours when no humans are using 
the systems.   A re-sync immediately afterwards, again with 120 jobs, scans 
both ends in minutes.
  >
  >-Chip
  >
  >-----Original Message-----
  >From: rsync [mailto:rsync-boun...@lists.samba.org] On Behalf Of Ken Chase
  >Sent: Friday, July 17, 2015 9:51 AM
  >To: samba-b...@samba.org
  >Cc: rsync...@samba.org
  >Subject: Re: [Bug 3099] Please parallelize filesystem scan
  >
  >I dont understand - scanning metadata is sped up by thrashing the head
  >all over the disk instead of mostly-sequentially scanning through?
  >
  >How does that work out?
  >
  >/kc
  >
  >
  >On Fri, Jul 17, 2015 at 02:37:21PM +0000, samba-b...@samba.org said:
  >  >https://bugzilla.samba.org/show_bug.cgi?id=3099
  >  >
  >  >--- Comment #8 from Chip Schweiss <c...@innovates.com> ---
  >  >I would argue that optionally all directory scanning should be made 
parallel.
  >  >Modern file systems perform best when request queues are kept full.  The
  >  >current mode of rsync scanning directories does nothing to take advantage 
of
  >  >this.
  >  >
  >  >I currently use scripts to split a couple dozen or so rsync jobs in to
  >  >literally 100's of jobs.   This reduces execution time from what would be 
days
  >  >to a couple hours every night.   There are lots of scripts like this 
appearing
  >  >on the net because the current state of rsync is inadequate.
  >  >
  >  >This ticket could reasonably combined with 5124.
  >  >
  >  >--
  >  >You are receiving this mail because:
  >  >You are the QA Contact for the bug.
  >  >
  >  >--
  >  >Please use reply-all for most replies to avoid omitting the mailing list.
  >  >To unsubscribe or change options: 
https://lists.samba.org/mailman/listinfo/rsync
  >  >Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
  >
  >--
  >Ken Chase - k...@heavycomputing.ca skype:kenchase23 Toronto Canada
  >Heavy Computing - Clued bandwidth, colocation and managed linux VPS @151 
Front St. W.
  >
  >--
  >Please use reply-all for most replies to avoid omitting the mailing list.
  >To unsubscribe or change options: 
https://lists.samba.org/mailman/listinfo/rsync
  >Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
  >
  >________________________________
  >
  >The material in this message is private and may contain Protected Healthcare 
Information (PHI). If you are not the intended recipient, be advised that any 
unauthorized use, disclosure, copying or the taking of any action in reliance 
on the contents of this information is strictly prohibited. If you have 
received this email in error, please immediately notify the sender via 
telephone or return mail.

-- 
Ken Chase - k...@heavycomputing.ca skype:kenchase23 +1 416 897 6284 Toronto 
Canada
Heavy Computing - Clued bandwidth, colocation and managed linux VPS @151 Front 
St. W.

-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html

Re: [Bug 3099] Please parallelize filesystem scan

Reply via email to