On Thu, Feb 28, 2013 at 5:04 PM, Magnus Thor Torfason < zulutime....@gmail.com> wrote:
> Hey all, > Sorry that I have to disagree with what most people said. I guess, Mark got closed to the what the current intend is. I've been following the discussion about FSFS format7, and had a question: > Is there any chance that the format would improve storage efficiency for > documents that are stored as compressed (zipped) bundles of XML files and > other resource files (Read MS Office Documents, but OpenOffice is similar). > Yes, exactly that: There is a *chance* that those will be stored more efficiently. The thing about this format is that is they are ZIP-compressed file trees with each file being something like an embedded picture, the main text body, the template etc. ZIP - in contrast to .tar.gz - compresses each of these files individually and then mainly concatenates them into the result file. As long as you don't change the template or any of the existing pictures, for instance, larger parts of the file should remain unchanged. PowerPoint presentations are probably the ones that benefit most from this scheme. Format7 will (hopefully) be able to deal with a few 100kB of inserted / removed data and still find all matching regions. This is exactly what we expect from office files: changes should affect some of the opaque data blocks but leave other ones alone. I'm finding that making very small changes in big documents (with embedded > images) results in rapid growth of the repository, since the binary diff > algorithm seems to not be able to figure out efficient deltas for this type > of documents, even though analysis of the contents shows that they are > almost unchanged. > In line with what others already said for this: there will be no format-specific delta algorithms. This would make SVN susceptible to attacks by manipulated user data (think of all the security issues that stem from invalid pictures or zip files). The furthest that we might go (not planned, though) is to have a set of alternative generic compression strategies plus an equally generic way to choose the best suitable one among them. Again, that is not planned for format7. > This may be outside the scope of format7, but I thought I'd ask the > question nevertheless. > No, it's right on the spot. But there will only be general algorithmic improvements that "happen" to help in your case. There is another idea that I had concerning efficient storage of office files: Templates and corporate ID data should result in long, identical sub-sections that can be found in many files. We might be able to identify these common blocks and store them only once. So far, I haven't tagged this idea with a target version. -- Stefan^2. -- Certified & Supported Apache Subversion Downloads: * http://www.wandisco.com/subversion/download *