> > md5sum [...] seems surplus.
Daniel Kahn Gillmor wrote:
> Right, but it would seem to fail for hardlinked files or deduped files,
> because it would weight one of the files in different places than the
Oh. I misunderstood the md5sum part again. It's for
the content indeed - as i thought first.
(That pipe is too long for my old brain.)
Ok. Now we are balancing on the edge of reproducability.
All files with the same content - hardlink or not - will have
the same sorting key value. You will have to make sure that
the items enter the sort always in the same starting order,
because the outcome is not defined with same-key items.
find . -type f -print0 | sort | xargs -0 md5sum | sort | ...
Let's hope for a sort algorithm that is deterministic with
same-key items. :))
This does still not give hardlink siblings the same weight,
because they are counted individually in the sort-list.
But the process of deciding which weight will apply to
the shared IsoFileSrc will be deterministic in libisofs.
> (and a pony! :) )
xorriso-1.4.2 will hopefully be released with pony.
> Also, the above doesn't do anything for non-directory, non-regular files
> (sockets, fifos, device nodes, etc) -- do those even make sense in
> ISO-9660? Do we need to worry about how/where they sort?
They are represented by empty regular ISO 9660 files with
Rock Ridge entries which identify them as their POSIX file
type. Unix-sockets make few sense, though.
Device nodes are a security risk, if not the mounter
applies -o nodev (and best "noexec,nosuid" too).
> Do we need to worry about how/where they sort?
They all end up at the block, which is reserved for empty
files. (We had nasty pseudo-hardlink problems when they
pointed to the start of the next non-empty file.)
> > Extent location of regular files:
> This is a triply-nice result, esp. because
Because it makes the pony affordable.
> What if you kept the red/black tree implementation, but keyed it by file
> content digest (md5, sha1, sha256, whatever) instead of by dev/inode
I could. But that would make all files with identical
content look like hardlinks.
Further it would cause an extra read run over the input files.
Not a problem for a small CD image. But there are BD XL out there.
> * Maybe there is some context where deduplicated files could be
It depends on the reader software. (Who woulda thunk that an
empty file can spoil interpretation of a non-empty file
with the same extent start.)
One would at least have to think more.
And after all, the pony will make us happy ... i hope.
Have a nice day :)
Reproducible-builds mailing list