Re: Detecting copies/moved files

Philipp Marek Sun, 26 Aug 2007 04:56:23 -0700

Hello Gunter!


On Sunday 26 August 2007 Gunter Ohrner wrote:
> Just a few questions to make sure I understood your ideas correctly, and
> one or another thoughts of mine...
>
> Am Sonntag, 26. August 2007 schrieb Philipp Marek:
> > - There'll be "fsvs copy"/"fsvs move" commands, which (when given some
> >   parameter) will call "cp -a"/"mv" with the arguments, for manual
> > copy/move.
> This will copy / move the file and additionally track this change for
> later submission to the repository server?
Right.

> > - Likewise some "fsvs copied-from", to tell what already
> > *has* been done.
> Is this meant to tag historical data which was already copied / moved and
> comitted some time ago, before fsvs was ready to track this?
No, that was meant to be in case some other process did this, and you'd like 
to tell FSVS that it should use this information - just like above, but 
without the actual copy/rename.

> Or is it merely for tracking copy / move-operations performed using
> standard command line or graphical tools before comitting the changes?
Yes.

> > - On commit itself normally no such things
> > would happen; although there'll probably be some option to re-enable
> > that.
> Question: Would there be any drawbacks of automagic copy / move detection?
> I currently do not see why "detect-copies" should not be performed (and
> the resulting information used) on any commit.
False positives and negatives?

> > distinct "rename" operation, there's a problem: If file A is missing,
> > but there are B and C with the same data -- is one renamed, and the
> > other copied?
> My first though was: Both are copies and the original is deleted
> afterwards.
Yes, that's what I tried to say in the indented paragraph.

> > - What about small files, which share the same MD5 because they have
> > the same data, but are "different" in the meaning of "independent"?
> > (Eg. the default config data in users' home directories).
> Mh, what's about these? As long as these files are identical, it's fine,
> and as soon as one copy changes it will deviate from the other copies,
> just as it happens on the local disc. At least that's the way it's done
> in a standard subversion working copy - will fsvs handle it differently?
The problem I see is that with normal subversion "copy" semantics some kind of 
relationship between /home/a/.kde/XXX and /home/b/.kde/XXX would be drawn.
Of course, if you have /etc/skel/ versioned too, then that could be used as 
source of the various /home/*/ directories ... and all would be fine.

I mostly fear confusion in a way of "~a/.bashrc is copied from ~b/.bashrc, and 
~a/.kde is copied from ~c/.kde" ... which would not be good.

> > - For big files that share some data, we can use the
> > pre-existing manber-hashes ... that's what they are there for.
> How do you want to share common parts within big files? Is the subversion
> repository able to handle something like that? Is it useful at all?
No, I don't think such things can (currently) be shared. The only use for such 
things is "I rename my MP3 file, and change that ID tag" ... then they 
wouldn't be identical, but have a common history.

> If the large files have a single source and still are similar but also
> already have been comitted, it's too late as the storage space within the
> repository is occupied twice already.
That's right. All copy-from information must be known before commit.

> If a file is duplicated using a copy 
and FSVS uses the copy-from information,
> and both copies deviate, a future 
> fsvs will track this and the subversion repository will only record the
> changes using its xdelta algorithm anyway.
How would tracking those be needed? Of course, FSVS should happily use the 
delta information (where applicable instead of full-text) -- but tracking?
Of course, subversion will just record changes.

Another case - B is copied from A, and committed with this information.
Later A is changed (A'), and B copied again (B') ... should now B' have 
history to B, or to A'?

> So, to be honest, I see no point in doing anything special to big files -
> probably I've not yet understood what you actually want to achieve by
> this... ;)
>
> Maybe you could elaborate?
Big files are only a single way special - FSVS already computes the 
manber-hashes (although very coarse ones), to speed up checking for changes.

> > - Could we use the inode number for detecting moved files? Only on
> No, an fsvs managed directory tree may reside on multiple local
> filesystems, or isn't that supported? (I think I didn't try it yet, but
> it would certainly be the case for me if I managed my whole system
> installation using fsvs - /, /usr, /var all are on different disks or at
> least disk partitions.)
Yes, but if you do "mv /usr/bin/a /usr/bin/c" they'll have the same inode 
number.

> MD5 is more robust against stuff like this.
Yes, but much more computing intensive.

> However, using the inodes would help - no, would be required - to allow
> hardlink tracking, so once hardlink support is added to fsvs, MD5 sums
> might be used as well as the inode number / device node.
Hardlink tracking in FSVS would mean some way to store this information in 
subversion, too ...

> > - For detecting copied/moved directories FSVS would see that
> > there is a new directory, and check its files and subdirectories ... if
> > there are entries that relate to some other directory (deleted or not)
> > we could draw some conclusions. Possibly use some percentage?
> Sounds smart, but does also sound "fragile" - my experience with "smart"
> software is that there are always more or less frequent "corner cases" in
> which the software just does NOT to what you did expect, and in the most
> annoying cases there's not even a way to force a manual override. :-(
That's why "autodetect" will only output a list - if you trust FSVS, you can 
do
        fsvs autodetect --find | fsvs autodetect load"
or some such.
Else use $EDITOR inbetween.

> > (http://svn.haxx.se/dev/archive-2001-11/0498.shtml) but for linked
> > entries that propagate their changes ... I don't think that's whats
> > needed here,
> Mh, but that's how a hardlink in the file system works, right?
Yes, of course ... but I'm not really sure where we should draw the boundary 
for FSVS.
- Does FSVS just do "snapshots"? Then we should record that they had some
  relationship, and restore it.
- Is the hardlink just used to save space, and does have no real other
  meaning? Then they should be treated as independent.
- Is that some kind of relationship meaning? Then other thoughts apply.

> What would happen if we manage a single file twice in the same fsvs
> managed directory tree, using hardlinks? I've never tried it so far, but
> what would happen if the file changes in the repository and fsvs starts
> updating "both" local files which actually are the same file?
They'd end up *without* being hardlinked.

> > I'd lean towards simply using some property on the file "UUID: had
>
> (... Conceptual hardlink problems cut out ...)
>
> > revisions ... If i commit /bin as r4, and /sbin as r5, there might be
>
> Properly supporting hardlinks will probably open a huge can of
> worms... :-/
That's what I fear, too.

> Especially as subversion itself has no concept of hardlinks so far, if I'm
> not mistaken, and thus trying to emulate it would probably cause
> unexpected behaviour in several not-too-uncommon cases...
Yes.

> (A changed file will look as "chnaged" through all of its links. Thus,
> fsvs would need to simulate this behaviour, maybe using a dummy-entry
> with some magic SVN properties for all but the "primary" hardlink, so
> that changes would be detected and restored properly. Now eg. what
> happens if this "primary" hardlink gets deleted?)
I already thought that only a "primary" entry gets the data, and all others 
have some kind of "externals" link - with revision number, so even if the 
entry gets deleted later the correct data can be restored.
If you commit all 5 names of a hardlink again, one gets the data, the other 
just the "it's over there" information.

But I'm not sure whether that's good ... would be a major difference to svn, 
and then svn could not be used anymore to get the data back. (At least not as 
easy as now.)


Thank you for your effort and comments.


Regards,

Phil



-- 
Versioning your /etc, /home or even your whole installation?
             Try fsvs (fsvs.tigris.org)!

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Detecting copies/moved files

Reply via email to