Re: naming files based on their hash

George Clemmer Mon, 07 May 2018 14:34:29 -0700

Hi Ciprian,

On 05/07/2018 at 05:55 Ciprian Dorin Craciun writes:


> On Mon, May 7, 2018 at 3:05 AM George Clemmer <myg...@gmail.com> wrote:
>> I just "resynced" my local maildir scratch. I expected all the files to
>> be renamed but I figured it would be no biggie to Git. I was a little
>> surprised when my Git repo grew from 2.5G to 4.5G :-O

My details: 16736 Maildir files occupying 3.1G backed up by a 2.5G git
repo. After the "hard" resync the repo grew to 4.5G. I worked around
this by backing out the "hard" resync commit to "shrink the baby" back
to it's former size ;-)

> I used a similar setup with Git as backup, and I have too experienced a
> "hard" resync which blew my Git repository size.
>
> However it wasn't Git fault, which if it would have encountered just file
> renames it would have been happy to re-use the "existing" data and not
> incur more storage than before.
>
> The "culprit" here is a header called `X-TUID` which seems to be added by
> `isync` for internal purposes.  (Therefore searching the mailing list
> archive for `X-TUID` will lead you to other people that stumbled into this
> issue.)

This is news to me. Thanks for pointing it out.

> Fortunately you can "convince" Git to repack the repository and check for
> file rewrites which will save you some space.  But depending on how large
> the repository is (i.e. how many files), could take some time and memory.
>   (Look in `man git-config` and play with the settings that pertain to
> rename detection.)

I looked at this a bit.It hurt my head. So I didn't try it ;-)

>> So ... this led me to wonder ... Would using a "stable" name based on a
>> checksum be a useful improvement? Naturally, since I am a Git addict, I
>> am thinking of 'git hash-object' ;-)
>
>
> Myself would think this would be a lovely idea, however due to the `X-TUID`
> header it would be pointless...

I haven't had a chance to look at `X-TUID` but I wonder if a hashed file
name couldn't replace it.

> However a better discussion would be the following:  how to use `isync` for
> archival purposes, including for "de-duping" mail accounts.  The "archival"
> is pretty simple:  no matter how many times you re-sync your inbox from
> scratch the file names should be consistent -- through hashing.  The
> "de-duping" is a little more complicated:  say you have multiple accounts
> (personal and for "business") and you forward some of them from one another
> (for accessibility);  however you don't want to delete forwarded emails;
>   now if you sync all these accounts you'll get the same email multiple
> times, and because of different "routing" headers they won't have the same
> match.
>
> However if you "split" the message appart -- headers and body -- the
> headers might have changed but the body will be identical.  Now if we can
> devise a way to write the two things apart, we'll end up with a better
> archival solution.  Unfortunately this won't be anymore a standard proper
> "maildir";  but fortunately with some FUSE one could re-present this
> "archive" as proper maildirs.  Bonus points if one also splits the email
> body into multiple MIME-parts, and de-dups those also (just think of a
> thread that re-sends the same attachment over and over...)
>
> But perhaps this "archival" use-case is far out of scope of `isync` and a
> tool written from scratch with exactly this purpose in mind would be
> better.
>
> Ciprian.

Probably. Or, for de-duplication one could use a standard tool.  I tried
https://github.com/borgbackup/borg on my 16736 Maildir files and it
produced a 1.9G archive.  For me, this .6G does not justify a switch to
borg ;-)

- George

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
isync-devel mailing list
isync-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/isync-devel

Re: naming files based on their hash

Reply via email to