Re: Huge directory tree: Get files to sync via tools like sysdig

2017-02-09 Thread Ben RUBSON
> On 10 Feb 2017, at 01:21, Karl O. Pinc  wrote:
> 
> On Fri, 10 Feb 2017 12:38:32 +1300
> Henri Shustak  wrote:
> 
>> As Ben mentioned, ZFS snapshots is one possible approach. Another
>> approach is to have a faster storage system. I have seen considerable
>> speed improvements with rsync on similar data sets by say upgrading
>> the storage sub system.
> 
> Another possibility could be to use lvm and lvmcache to throw a ssd in
> front of the spinning disks.  This would only improve things if
> you didn't otherwise fill up the cache with data -- you want
> the cache to contain inodes.  So this might work only if your
> ssd cache was larger than whatever amount of data you typically
> write between rsync runs, plus enough to hold all the inodes
> in your rsync-ed fs.
> 
> I've not tried this.  I'm not even certain it's a good idea.  It's
> just a thought.

It's also possible to have a SSD cache with ZFS (called the L2ARC).
You can even ask this cache to only store your metadata.

Some (same ?) changes may also be needed on receiver/server side too
(depending on its current setting) to see a performance improvement.

Ben

-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: Huge directory tree: Get files to sync via tools like sysdig

2017-02-09 Thread Henri Shustak
That sounds like it certinally would not hurt!


This email is protected by LBackup, an open source backup solution
http://www.lbackup.org


-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: Huge directory tree: Get files to sync via tools like sysdig

2017-02-09 Thread Henri Shustak
As Ben mentioned, ZFS snapshots is one possible approach. Another approach is 
to have a faster storage system. I have seen considerable speed improvements 
with rsync on similar data sets by say upgrading the storage sub system.


This email is protected by LBackup, an open source backup solution
http://www.lbackup.org



-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: Huge directory tree: Get files to sync via tools like sysdig

2017-02-09 Thread Ben RUBSON

> On 09 Feb 2017, at 16:10, Thomas Güttler  wrote:
> 
> Am 09.02.2017 um 11:05 schrieb Ben RUBSON:
>>> On 09 Feb 2017, at 10:05, Thomas Güttler  
>>> wrote:
>>> 
>>> Hi,
>>> 
>>> we have a huge directory tree.
>>> 
>>> 
>>> * 17M files (number of files)
>>> * 2.2TBytes of data.
>>> * Only 0.1% changes per day
>>> 
>>> Current pain: rsyncs directory tree traversal needs to long to discover the 
>>> changed files.
>> 
>> Hi,
>> 
>> On which type of FS is this directory ?
> 
> ext4

Any way to prefer snapshots in your backup strategy ?
Or to use a ZFS ready OS to benefit from a SSD cache (which would store your 
metadata) ?
-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: Huge directory tree: Get files to sync via tools like sysdig

2017-02-09 Thread Axel Kittenberger
Directory creation is not a race condition when done properly.

The application (like Lsyncd) gets a directory creation event, creates a
watch for the directory and scans the new directory for files or
subdirectories in there, subdirectories are handled recursevly.

This way nothing can be missed.

The general warning of "bugs may be possible" is a no-brainer. Yes, they
are always possible, everywhere.

As said, there are some issues with the "move" (aka rename) event to be
detected as such, sometimes it may be detected as a create / delete without
proper acknowleding the move within the watched tree. And events may not
arrive in the same order as they happened, due to multi-core nature of
modern systems. But otherwise than that, I'm convinced it is fine. And all
of this is not a real issue with event based filter list creation to minify
rsyncs work.

The only other issue I know of is hard links. Create a hard link outside
the watched directory to a file within the watched directory tree and
altering will not create an event. In that case you just must not do them.
This has hardly been an issue in most usecases tough.
-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html

Re: Huge directory tree: Get files to sync via tools like sysdig

2017-02-09 Thread Karl O. Pinc
On Thu, 9 Feb 2017 14:43:57 +0100
Axel Kittenberger  wrote:

> >
> > Not only that, but inotify is not guaranteed.  (At least not on
> > 3.16.0.  Can't say regards later versions.)  So you might miss some
> > changes.
> >  
> 
> Got any info on that?
> 
> I noted that MOVE_FROM and MOVE_TO events are not guaranted to arrive
> in order, or even the file descriptor might briefly close with "no
> more events" inbetween them, but I never ever heared of anybody
> encountering an issue of an event in a watched directory on not being
> correctly reported, without getting the information of an overlfow
> with an OVERFLOW event, which results in case of Lsyncd in a full
> rescan of everything.

Not much.  inotify(7) on my system says:

   With careful programming, an application can use inotify to
   efficiently monitor and cache the state of a set of filesystem
   objects.   However, robust applications should allow for the
   fact that bugs in the monitor‐ ing logic or races of the kind
   described  below  may  leave  the  cache inconsistent  with  the
   filesystem state.  It is probably wise to to do some consistency
   checking, and rebuild the cache  when  inconsistencies are
   detected.

I think one of the pretty much unavoidable race conditions is
sub-directory creation; the sub-directory can have files added
to it before the monitoring process is able to set a watch
on it.  Of course this is an application level race.

I've had incron (which uses inotify) regularly fail to
catch all monitored fs changes on a busy system.  And
the monitored system does not involve creating sub-directories --
and I don't think I'm exceeding the system's inotify event limit
either.  But I could be wrong about either of these.

So perhaps the take-away is that inotify is "hard", or even
"impossible" to rely on as the sole method for change monitoring.
It may not be right to say it's "unreliable" as I did above.
I'm not the expert here.  But I can say that my limited
experience with it makes me want to look very closely
before relying on it.

Regards,

Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein

-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: Huge directory tree: Get files to sync via tools like sysdig

2017-02-09 Thread Karl O. Pinc
On Thu, 9 Feb 2017 10:55:51 +0100
Axel Kittenberger  wrote:

> > Has someone experience with collecting the changed files
> > with a third party tool which detects which files were changed?  
> 
> I don't know of sysdig but am the developer of Lsyncd which does
> exactly that, collect file changes via inotify event mechanism and
> then calls rsync with a matching filter mask.
> 
> However, since you say, your directory tree is hugh, the main issue
> is that for every directory an inotify watch must be created, taking
> about 1KB of kernel memory per watch.

Not only that, but inotify is not guaranteed.  (At least not on
3.16.0.  Can't say regards later versions.)  So you might miss some
changes.


Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein

-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: Huge directory tree: Get files to sync via tools like sysdig

2017-02-09 Thread Axel Kittenberger
>
> Not only that, but inotify is not guaranteed.  (At least not on
> 3.16.0.  Can't say regards later versions.)  So you might miss some
> changes.
>

Got any info on that?

I noted that MOVE_FROM and MOVE_TO events are not guaranted to arrive in
order, or even the file descriptor might briefly close with "no more
events" inbetween them, but I never ever heared of anybody encountering an
issue of an event in a watched directory on not being correctly reported,
without getting the information of an overlfow with an OVERFLOW event,
which results in case of Lsyncd in a full rescan of everything.
-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html

Re: Huge directory tree: Get files to sync via tools like sysdig

2017-02-09 Thread Ben RUBSON
> On 09 Feb 2017, at 10:05, Thomas Güttler  wrote:
> 
> Hi,
> 
> we have a huge directory tree.
> 
> 
> * 17M files (number of files)
> * 2.2TBytes of data.
> * Only 0.1% changes per day
> 
> Current pain: rsyncs directory tree traversal needs to long to discover the 
> changed files.

Hi,

On which type of FS is this directory ?

Ben
-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: Huge directory tree: Get files to sync via tools like sysdig

2017-02-09 Thread Axel Kittenberger
> Has someone experience with collecting the changed files
> with a third party tool which detects which files were changed?

I don't know of sysdig but am the developer of Lsyncd which does exactly
that, collect file changes via inotify event mechanism and then calls rsync
with a matching filter mask.

However, since you say, your directory tree is hugh, the main issue is that
for every directory an inotify watch must be created, taking about 1KB of
kernel memory per watch. If you got a million directories this is a GB of
unswapable memory use.

Unfortunally the Linux kernel doesn't provide a better way yet, and I
suppose other tools like sysdig suffer from the same issue. There is
fanotify, but that doesn't report move event and thus is not useable for
this task.

Kind regards, Axel

On Thu, Feb 9, 2017 at 10:05 AM, Thomas Güttler <
guettl...@thomas-guettler.de> wrote:

> Hi,
>
> we have a huge directory tree.
>
>
>  * 17M files (number of files)
>  * 2.2TBytes of data.
>  * Only 0.1% changes per day
>
> Current pain: rsyncs directory tree traversal needs to long to discover
> the changed files. Only few files change.
>
> I discovered the tool sysdig which could be used to monitor the files
> which were changed.
>
> Then we could feed the list of changed files to rsync and avoid the long
> directory traversal of rsync.
>
> Has someone experience with collecting the changed files with a third
> party tool which detects which
> files were changed?
>
> Regards,
>  Thomas Güttler
>
>
>
> --
> Thomas Guettler http://www.thomas-guettler.de/
>
> --
> Please use reply-all for most replies to avoid omitting the mailing list.
> To unsubscribe or change options: https://lists.samba.org/mailma
> n/listinfo/rsync
> Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
>
-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html