Re: Detecting copies/moved files

Gunter Ohrner Sun, 26 Aug 2007 03:47:59 -0700

Hi!

Just a few questions to make sure I understood your ideas correctly, and 
one or another thoughts of mine...


Am Sonntag, 26. August 2007 schrieb Philipp Marek:
> - There'll be "fsvs copy"/"fsvs move" commands, which (when given some
>   parameter) will call "cp -a"/"mv" with the arguments, for manual
> copy/move.

This will copy / move the file and additionally track this change for 
later submission to the repository server?

> - Likewise some "fsvs copied-from", to tell what already 
> *has* been done.

Is this meant to tag historical data which was already copied / moved and 
comitted some time ago, before fsvs was ready to track this?

Or is it merely for tracking copy / move-operations performed using 
standard command line or graphical tools before comitting the changes?

> - Then I'll do some "fsvs detect-copies", which will 
> output some kind of list to STDOUT, for manual checking. This list can
> be re-imported and used.

That's for the same purpose as "copied-from", but automagically? Sounds 
good.

> - On commit itself normally no such things 
> would happen; although there'll probably be some option to re-enable
> that.

Question: Would there be any drawbacks of automagic copy / move detection? 
I currently do not see why "detect-copies" should not be performed (and 
the resulting information used) on any commit.

> - If two files have the same MD5, they'll be found. If the original
(...)
> distinct "rename" operation, there's a problem: If file A is missing,
> but there are B and C with the same data -- is one renamed, and the
> other copied?

My first though was: Both are copies and the original is deleted 
afterwards.

> - What about small files, which share the same MD5 because they have
> the same data, but are "different" in the meaning of "independent"?
> (Eg. the default config data in users' home directories).

Mh, what's about these? As long as these files are identical, it's fine, 
and as soon as one copy changes it will deviate from the other copies, 
just as it happens on the local disc. At least that's the way it's done 
in a standard subversion working copy - will fsvs handle it differently?

> - For big files that share some data, we can use the
> pre-existing manber-hashes ... that's what they are there for.

How do you want to share common parts within big files? Is the subversion 
repository able to handle something like that? Is it useful at all?

If the large files have a single source and still are similar but also 
already have been comitted, it's too late as the storage space within the 
repository is occupied twice already.

If a file is duplicated using a copy and both copies deviate, a future 
fsvs will track this and the subversion repository will only record the 
changes using its xdelta algorithm anyway.

So, to be honest, I see no point in doing anything special to big files - 
probably I've not yet understood what you actually want to achieve by 
this... ;)

Maybe you could elaborate?

> - Could we use the inode number for detecting moved files? Only on

No, an fsvs managed directory tree may reside on multiple local 
filesystems, or isn't that supported? (I think I didn't try it yet, but 
it would certainly be the case for me if I managed my whole system 
installation using fsvs - /, /usr, /var all are on different disks or at 
least disk partitions.)

MD5 is more robust against stuff like this.

However, using the inodes would help - no, would be required - to allow 
hardlink tracking, so once hardlink support is added to fsvs, MD5 sums 
might be used as well as the inode number / device node.

> - For detecting copied/moved directories FSVS would see that
> there is a new directory, and check its files and subdirectories ... if
> there are entries that relate to some other directory (deleted or not)
> we could draw some conclusions. Possibly use some percentage?

Sounds smart, but does also sound "fragile" - my experience with "smart" 
software is that there are always more or less frequent "corner cases" in 
which the software just does NOT to what you did expect, and in the most 
annoying cases there's not even a way to force a manual override. :-(

> (http://svn.haxx.se/dev/archive-2001-11/0498.shtml) but for linked
> entries that propagate their changes ... I don't think that's whats
> needed here,

Mh, but that's how a hardlink in the file system works, right?

What would happen if we manage a single file twice in the same fsvs 
managed directory tree, using hardlinks? I've never tried it so far, but 
what would happen if the file changes in the repository and fsvs starts 
updating "both" local files which actually are the same file?

> I'd lean towards simply using some property on the file "UUID: had

(... Conceptual hardlink problems cut out ...)

> revisions ... If i commit /bin as r4, and /sbin as r5, there might be

Properly supporting hardlinks will probably open a huge can of 
worms... :-/

Especially as subversion itself has no concept of hardlinks so far, if I'm 
not mistaken, and thus trying to emulate it would probably cause 
unexpected behaviour in several not-too-uncommon cases...

(A changed file will look as "chnaged" through all of its links. Thus, 
fsvs would need to simulate this behaviour, maybe using a dummy-entry 
with some magic SVN properties for all but the "primary" hardlink, so 
that changes would be detected and restored properly. Now eg. what 
happens if this "primary" hardlink gets deleted?)

Greetings,

  Gunter

signature.asc
Description: This is a digitally signed message part.

Re: Detecting copies/moved files

Reply via email to