Hi Ciprian, On 05/07/2018 at 05:55 Ciprian Dorin Craciun writes:
> On Mon, May 7, 2018 at 3:05 AM George Clemmer <myg...@gmail.com> wrote: >> I just "resynced" my local maildir scratch. I expected all the files to >> be renamed but I figured it would be no biggie to Git. I was a little >> surprised when my Git repo grew from 2.5G to 4.5G :-O My details: 16736 Maildir files occupying 3.1G backed up by a 2.5G git repo. After the "hard" resync the repo grew to 4.5G. I worked around this by backing out the "hard" resync commit to "shrink the baby" back to it's former size ;-) > I used a similar setup with Git as backup, and I have too experienced a > "hard" resync which blew my Git repository size. > > However it wasn't Git fault, which if it would have encountered just file > renames it would have been happy to re-use the "existing" data and not > incur more storage than before. > > The "culprit" here is a header called `X-TUID` which seems to be added by > `isync` for internal purposes. (Therefore searching the mailing list > archive for `X-TUID` will lead you to other people that stumbled into this > issue.) This is news to me. Thanks for pointing it out. > Fortunately you can "convince" Git to repack the repository and check for > file rewrites which will save you some space. But depending on how large > the repository is (i.e. how many files), could take some time and memory. > (Look in `man git-config` and play with the settings that pertain to > rename detection.) I looked at this a bit.It hurt my head. So I didn't try it ;-) >> So ... this led me to wonder ... Would using a "stable" name based on a >> checksum be a useful improvement? Naturally, since I am a Git addict, I >> am thinking of 'git hash-object' ;-) > > > Myself would think this would be a lovely idea, however due to the `X-TUID` > header it would be pointless... I haven't had a chance to look at `X-TUID` but I wonder if a hashed file name couldn't replace it. > However a better discussion would be the following: how to use `isync` for > archival purposes, including for "de-duping" mail accounts. The "archival" > is pretty simple: no matter how many times you re-sync your inbox from > scratch the file names should be consistent -- through hashing. The > "de-duping" is a little more complicated: say you have multiple accounts > (personal and for "business") and you forward some of them from one another > (for accessibility); however you don't want to delete forwarded emails; > now if you sync all these accounts you'll get the same email multiple > times, and because of different "routing" headers they won't have the same > match. > > However if you "split" the message appart -- headers and body -- the > headers might have changed but the body will be identical. Now if we can > devise a way to write the two things apart, we'll end up with a better > archival solution. Unfortunately this won't be anymore a standard proper > "maildir"; but fortunately with some FUSE one could re-present this > "archive" as proper maildirs. Bonus points if one also splits the email > body into multiple MIME-parts, and de-dups those also (just think of a > thread that re-sends the same attachment over and over...) > > But perhaps this "archival" use-case is far out of scope of `isync` and a > tool written from scratch with exactly this purpose in mind would be > better. > > Ciprian. Probably. Or, for de-duplication one could use a standard tool. I tried https://github.com/borgbackup/borg on my 16736 Maildir files and it produced a 1.9G archive. For me, this .6G does not justify a switch to borg ;-) - George ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ isync-devel mailing list isync-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/isync-devel