Re: [fossil-users] Question about the file formats.
On Dec 20, 2016 10:59 PM, "John Found"wrote: Well, the compression is the last thing I am talking about. It is important, but not essential. I am talking about several people working on one file and then fossil merging the changes automatically (of course if there is no conflicts in the edits). I think the answer to your question is that merging depends on a knowledge of the structure of data in order to detect where conflicts do or do not exist. The structure of text files is "an ordered sequence of variable length records" and the merge algorithm sees non overlapping changes as independent. This is not always true, but it works often enough to be useful. Because it is not always true, it is important to test post merge & pre commit. The merge algorithm could be modified to work with other data structures but it would still require the property that non overlapping changes be independent (have no impact on previous or future data). With a "binary" format there are many other things that could go wrong. Fixed length records, specific requirements for alignment, embedded non symbolic references to other parts of the file are the first few that come to mind. Without specific knowledge of the structure of the data, merge can't work. Even with knowledge of the structure of text files, it can still get things wrong. ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] Question about the file formats.
On Tue, Dec 20, 2016 at 08:48:27PM +0200, John Found wrote: > What makes the binary files different from the text files? The presence or > absence of > 0 bytes does not seems to make serious difference for processing by the same > algorithms. Many text formats allow merging changes from one version to another with minimal context. E.g. let's say you start from a C file and modify a line in the middle in your checkout and then update your tree. Someone else added a new function at the beginning of the file. This creates a conflict and Fossil will try to resolve it by finding the context of the line you modified in a similiar place and then readd that change. While this doesn't work all the time for text files, it has a high chance of working. Even if it doesn't work i.e. because the changes overlap, it provides enough information that a user can typically do the same. The same kind of tooling could be provided for binary formats, but it is rarely exist directly. Joerg ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] Question about the file formats.
On Dec 21, 2016 10:57 AM, "Warren Young"wrote: That is exactly what I’m talking about in my BMP vs PNG examples. If you wish to discuss a different file type than than bitmap graphics, give your own example. Until then, mine is the only concrete example we have available to discuss. Zip files and similar archives apply here as well, i think (that includes modern office suite formats, many of which are zip files). Without knowing how to dissect them and diff the individual components, it can only perform generic binary delta compression. i opine, without any proof to back it up, that the compression results would not be appreciably better were fossil to "know" about such content (for most common file formats), while performance, complexity, and memory costs would be negatively impacted. - stephan Sent from a mobile device, possibly from bed. Please excuse brevity, typos, and top-posting. ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] Question about the file formats.
Well, the compression is the last thing I am talking about. It is important, but not essential. I am talking about several people working on one file and then fossil merging the changes automatically (of course if there is no conflicts in the edits). On Tue, 20 Dec 2016 16:58:18 -0700 Warren Youngwrote: > On Dec 20, 2016, at 3:57 PM, Warren Young wrote: > > > >> What if I design some text file format (containing only ascii characters) > >> and > >> it can't be properly processed by fossil? > > > > Then you should post it as a replicable test case for our study. > > I decided to take up my own challenge. Consider: > > ## Create new repo; note initial size > $ f init ../x.fossil > $ ls -lh ../x.fossil > -rw-r--r-- 1 me group 212K Dec 20 16:13 ../x.fossil > > ## Go grab a free PNG file, and re-save it with Photoshop’s > ## Save for Web function to reduce unnecessary differences > $ wget > https://upload.wikimedia.org/wikipedia/commons/thumb/3/31/Topographic_Map_of_Bulgaria_Bulgarian.png/120px-Topographic_Map_of_Bulgaria_Bulgarian.png > $ open -a 'Adobe Photoshop CC 2017' > 120px-Topographic_Map_of_Bulgaria_Bulgarian.png > $ ls -lh 120px-Topographic_Map_of_Bulgaria_Bulgarian.png > -rw-r--r-- 1 me group23K Dec 20 16:12 > 120px-Topographic_Map_of_Bulgaria_Bulgarian.png > > ## Add it to repo; notice that repo size goes up by 20 kB, > ## showing that Fossil’s internal compression managed to > ## squeeze an additional 3 kB over what Photoshop gives, > ## probably due to metadata compression > $ f add 120px-Topographic_Map_of_Bulgaria_Bulgarian.png > $ f ci -m initial > $ f rebuild --compress --vacuum > $ ls -lh ../x.fossil > -rw-r--r-- 1 me group 232K Dec 20 16:13 ../x.fossil > > ## Change upper left corner pixel, amounting to only several > ## bits of difference in the raw data > $ open -a 'Adobe Photoshop CC 2017' > 120px-Topographic_Map_of_Bulgaria_Bulgarian.png > $ ls -lh 120px-Topographic_Map_of_Bulgaria_Bulgarian.png > -rw-r--r-- 1 me group23K Dec 20 16:14 > 120px-Topographic_Map_of_Bulgaria_Bulgarian.png > > ## Check change in; notice that roughly a dozen bits of change in > ## the raw data became 28 kB of difference in the repo size! > $ f ci -m '1 px change’ > $ f rebuild --compress --vacuum > $ ls -lh ../x.fossil > -rw-r--r-- 1 me group 260K Dec 20 16:14 ../x.fossil > > > Repeating that test with TIFF and PSD files didn’t give as small a difference > in the resulting Fossil repos size between checkins as I’d expected, but on > investigating I found that Photoshop writes a bunch of stuff into the > metadata that change on every save (e.g. timestamps, UUIDs…) which balloons > the diffs. > > Switching to Windows BMP fixes this: a 1px change results in a negligible > change in the repo size, because only about a dozen bits change in the raw > data. (Windows BMP has very little metadata.) > ___ > fossil-users mailing list > fossil-users@lists.fossil-scm.org > http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users -- http://fresh.flatassembler.net http://asm32.info John Found ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] Question about the file formats.
On Dec 20, 2016, at 3:57 PM, Warren Youngwrote: > >> What if I design some text file format (containing only ascii characters) and >> it can't be properly processed by fossil? > > Then you should post it as a replicable test case for our study. I decided to take up my own challenge. Consider: ## Create new repo; note initial size $ f init ../x.fossil $ ls -lh ../x.fossil -rw-r--r-- 1 me group 212K Dec 20 16:13 ../x.fossil ## Go grab a free PNG file, and re-save it with Photoshop’s ## Save for Web function to reduce unnecessary differences $ wget https://upload.wikimedia.org/wikipedia/commons/thumb/3/31/Topographic_Map_of_Bulgaria_Bulgarian.png/120px-Topographic_Map_of_Bulgaria_Bulgarian.png $ open -a 'Adobe Photoshop CC 2017' 120px-Topographic_Map_of_Bulgaria_Bulgarian.png $ ls -lh 120px-Topographic_Map_of_Bulgaria_Bulgarian.png -rw-r--r-- 1 me group23K Dec 20 16:12 120px-Topographic_Map_of_Bulgaria_Bulgarian.png ## Add it to repo; notice that repo size goes up by 20 kB, ## showing that Fossil’s internal compression managed to ## squeeze an additional 3 kB over what Photoshop gives, ## probably due to metadata compression $ f add 120px-Topographic_Map_of_Bulgaria_Bulgarian.png $ f ci -m initial $ f rebuild --compress --vacuum $ ls -lh ../x.fossil -rw-r--r-- 1 me group 232K Dec 20 16:13 ../x.fossil ## Change upper left corner pixel, amounting to only several ## bits of difference in the raw data $ open -a 'Adobe Photoshop CC 2017' 120px-Topographic_Map_of_Bulgaria_Bulgarian.png $ ls -lh 120px-Topographic_Map_of_Bulgaria_Bulgarian.png -rw-r--r-- 1 me group23K Dec 20 16:14 120px-Topographic_Map_of_Bulgaria_Bulgarian.png ## Check change in; notice that roughly a dozen bits of change in ## the raw data became 28 kB of difference in the repo size! $ f ci -m '1 px change’ $ f rebuild --compress --vacuum $ ls -lh ../x.fossil -rw-r--r-- 1 me group 260K Dec 20 16:14 ../x.fossil Repeating that test with TIFF and PSD files didn’t give as small a difference in the resulting Fossil repos size between checkins as I’d expected, but on investigating I found that Photoshop writes a bunch of stuff into the metadata that change on every save (e.g. timestamps, UUIDs…) which balloons the diffs. Switching to Windows BMP fixes this: a 1px change results in a negligible change in the repo size, because only about a dozen bits change in the raw data. (Windows BMP has very little metadata.) ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] Question about the file formats.
On Dec 20, 2016, at 12:35 PM, John Foundwrote: > > Under "fossil algorithms" I mean two (in my understanding most important in > what is called "version control": diff algorithm and 3-way merge algorithm. When I said that Fossil can’t diff two binary files, I meant that it couldn’t display a sensible difference to the terminal when you give the “fossil diff” command. However, Fossil *can store* the difference between any two files, regardless of binary vs. text, as I suggested with my uncompressed TIFF example. Fossil will even do so for files like PNGs where the worst case is that a single bit change in the original file could potentially change every byte in the output file, making the internal diffs Fossil stores very large, possibly to the point that there’s no value in delta compression at all, so that Fossil must simply store both versions in toto. But Fossil will store those versions, and retrieve them. As for merging, as long as the two versions Fossil is trying to merge have sufficient context between the changes to safely do the merge automatically, Fossil will do so. Just as with diffing, if you use compression or encryption or otherwise cause the merged parts to overlap, Fossil won’t be able to do the merge automatically. This is no different for what we choose to call “text” files, where if two users make a change to the same area of a single file, chances are high that Fossil will refuse to attempt an automatic merge, since there isn’t enough context between the changes for Fossil to be sure it isn’t creating a mess in the merge area. > Or what makes the 3-way merge algorithm not working on binary files. Except for whole-file compression and similar cases (e.g. pre-checkin encryption) I don’t think you can create a replicable test that shows that it doesn’t work. > What if I design some text file format (containing only ascii characters) and > it can't be properly processed by fossil? Then you should post it as a replicable test case for our study. Until you can do both things — i.e. cause a problem and create a replicable test case for it — you’re just speculating. > Another example: Every binary file can be BASE64 encoded and it will be > turned into a > valid text file. Fossil will not detect it as a binary. But whether this file > will be > processed properly on diffs and merges? Probably not. But why? I don’t believe such an encoding will have a meaningful effect on any test, except that it effectively adds newlines every 70-some characters, where the original binary data might not have it, so “binary” data would now be detected as “text” data. But, if the problem is that delta compression is inefficient with a given binary file because nearly every byte changes when you change just one small bit of the input file, then the same will be true of the Base64-encoded version. ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] Question about the file formats.
Thus said John Found on Tue, 20 Dec 2016 21:35:44 +0200: > For example, I can't see what is the problem to make diff of binary > files. As a result one will have the bytes that have to be > inserted/deleted from the first file in order to turn it into the > second. (Or I am wrong and that is why I ask such vague questions). It certainly is possible, though not currently implemented as far as I know. The binary diff can be described simply as the deltas required to get from A to B. You might experiment with the following test commands: test-delta test-delta-analyze test-delta-apply test-delta-create It would seem that what you're asking for is a binary patch that perhaps takes advantage of the delta encoding algorithm? This would certainly require a special binary that understands the format of the data, but I don't see why this shouldn't be possible. I believe Fossil stores the baseline and deltas going back in time, so to open the most recent version of a file, it just gets the latest artifact, but to get older versions, it has to apply deltas. Perhaps the following will address some of your questions: http://www.fossil-scm.org/index.html/doc/trunk/www/delta_format.wiki http://www.fossil-scm.org/index.html/doc/trunk/www/delta_encoder_algorithm.wiki http://www.fossil-scm.org/index.html/doc/trunk/www/concepts.wiki > Or what makes the 3-way merge algorithm not working on binary files. > The line organization of the text files? Something else? I'm not sure what it uses for binary files, but there is definitely a delta component being generated and stored. As a test, I added a large binary AVI to a fossil. Before: SIZE DATE FILE 217088 Dec 20 14:47 new.fossil 15155678 Dec 20 14:48 MVI_7509.AVI After: 15171584 Dec 20 14:48 new.fossil I then used vi to update a few bytes in the file and committed: 30117888 Dec 20 14:49 new.fossil It did double in size (not sure why, but I suspect it has something to do with establishing a baseline for the delta). But, I repeated the edit with vi and changed additional other bytes, but this time, it didn't grow very much at all: 30121984 Dec 20 14:50 new.fossil And again: 30130176 Dec 20 14:54 new.fossil So it is clearly efficiently storing them. Experimenting with the test-delta-create command, I get: 15155678 Dec 20 15:12 MVI_7509.AVI.first 15155695 Dec 20 15:13 MVI_7509.AVI.second $ fossil test-delta-create MVI_7509.AVI.first MVI_7509.AVI.second MVI_7509.AVI.delta 66 Dec 20 15:14 MVI_7509.AVI.delta So the delta is only 66 bytes: $ cat MVI_7509.AVI.delta up7k 3sQ@0,2:ab4yT@3sQ,4:zzjkf2@8pr,5:djfjkufdb@9Ut,5: fff 37s_Sf; Could I share this with someone? Sure. They could then use ``fossil test-delta-apply'' to use the ``patch.'' > What if I design some text file format (containing only ascii > characters) and it can't be properly processed by fossil? If it contains only ASCII characters then Fossil will have no problems handling it as a text file. It won't matter what arrangement of characters you place in such a file because they will be only ASCII. > Another example: Every binary file can be BASE64 encoded and it will > be turned into a valid text file. Fossil will not detect it as a > binary. But whether this file will be processed properly on diffs and > merges? Sure, you'll get an ASCII diff of the file, but it won't really mean much to describe the BASE64 diff between two files, however, if that's what you want, commit all binaries as BASE64 encoded files. Then you'll have to BASE64 decode the file as part of your ``build'' process, and you can even send patches/diffs of them. Andy -- TAI64 timestamp: 40005859adef ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] Question about the file formats.
I am not talking about the fossil heuristics in detection of what file is binary and what file is text. Imagine all detection is switched off. Under "fossil algorithms" I mean two (in my understanding most important in what is called "version control": diff algorithm and 3-way merge algorithm. For example, I can't see what is the problem to make diff of binary files. As a result one will have the bytes that have to be inserted/deleted from the first file in order to turn it into the second. (Or I am wrong and that is why I ask such vague questions). Or what makes the 3-way merge algorithm not working on binary files. The line organization of the text files? Something else? What if I design some text file format (containing only ascii characters) and it can't be properly processed by fossil? Another example: Every binary file can be BASE64 encoded and it will be turned into a valid text file. Fossil will not detect it as a binary. But whether this file will be processed properly on diffs and merges? Probably not. But why? On Tue, 20 Dec 2016 12:13:43 -0700 Warren Youngwrote: > On Dec 20, 2016, at 11:48 AM, John Found wrote: > > > > I know that fossil (and most other version control systems) can handle > > properly > > only text source files. > > Says who? > > There are some features of Fossil that simply don’t work when given a binary > file, like “fossil diff,” but if you think this is a missing feature (or even > a bug!) I’d have to ask how you think it should work? > > Consider the case of a PNG. How would you expect “fossil diff” to show the > difference between two PNGs? > > Now multiply by the number of other binary file formats. > > It is also the case that checking in compressed binary files is generally a > mistake, since that will largely defeat the built-in diffing and compression > mechanisms in Fossil, bloating the repository on every checkin. > > (For some use cases, you can now avoid this problem with the new unversioned > files feature.) > > Both of those classes of problem aside, Fossil will certainly accept “binary” > files. > > > What makes the binary files different from the text files? The presence or > > absence of > > 0 bytes does not seems to make serious difference for processing by the > > same algorithms. > > Fossil uses a heuristic to decide if a given file is “binary” or not, and it > has more to do with the chance that it will display properly when served to a > web browser than anything else. > > Because it is a heuristic, it is possible to trick it. For example, very > long text lines may be misdetected as a “binary” file, because it runs out of > buffer space looking for the first line terminator. > > > What properties a file format needs in order to be processed properly by > > fossil? > > Give a specific use case. The answer differs depending on what Fossil > commands you want to be able to use on the files you check in. > > I gave the “diff” case above, but that is not the only command that changes > behavior depending on whether the binary file heuristic decides that the file > is “binary.” > > I’m putting “binary” in quotes because it is not a clear-cut distinction. > For Fossil’s purposes, an uncompressed TIFF is “less binary” than a PNG file, > because it is possible to do useful levels of delta compression on the TIFF > but not on the PNG. > > > Is it enough for a file to contains only utf-8 characters or some other > > properties are > > mandatory as well? > > If you want to know the heuristic’s current implementation details, study > looks_like_utf8() in src/lookslike.c. > > (There is also a UTF-16 version of that function, typically needed on > Windows.) > > > Is it possible to define such binary file format that to be properly > > processed > > by fossil (of course, after removing the explicit binary file checks)? > > > > Or the opposite question: Is it possible to compose such text file that to > > not be > > processed properly by fossil algorithms? > > Both questions should be answered by a study of that heuristic function. > > If you have further questions, make your questions more specific. Your > current questions are so vague that I can give the answer “Yes” to both, and > be correct. Not useful, I realize, but correct. :) > ___ -- http://fresh.flatassembler.net http://asm32.info John Found ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] Question about the file formats.
On Dec 20, 2016, at 11:48 AM, John Foundwrote: > > I know that fossil (and most other version control systems) can handle > properly > only text source files. Says who? There are some features of Fossil that simply don’t work when given a binary file, like “fossil diff,” but if you think this is a missing feature (or even a bug!) I’d have to ask how you think it should work? Consider the case of a PNG. How would you expect “fossil diff” to show the difference between two PNGs? Now multiply by the number of other binary file formats. It is also the case that checking in compressed binary files is generally a mistake, since that will largely defeat the built-in diffing and compression mechanisms in Fossil, bloating the repository on every checkin. (For some use cases, you can now avoid this problem with the new unversioned files feature.) Both of those classes of problem aside, Fossil will certainly accept “binary” files. > What makes the binary files different from the text files? The presence or > absence of > 0 bytes does not seems to make serious difference for processing by the same > algorithms. Fossil uses a heuristic to decide if a given file is “binary” or not, and it has more to do with the chance that it will display properly when served to a web browser than anything else. Because it is a heuristic, it is possible to trick it. For example, very long text lines may be misdetected as a “binary” file, because it runs out of buffer space looking for the first line terminator. > What properties a file format needs in order to be processed properly by > fossil? Give a specific use case. The answer differs depending on what Fossil commands you want to be able to use on the files you check in. I gave the “diff” case above, but that is not the only command that changes behavior depending on whether the binary file heuristic decides that the file is “binary.” I’m putting “binary” in quotes because it is not a clear-cut distinction. For Fossil’s purposes, an uncompressed TIFF is “less binary” than a PNG file, because it is possible to do useful levels of delta compression on the TIFF but not on the PNG. > Is it enough for a file to contains only utf-8 characters or some other > properties are > mandatory as well? If you want to know the heuristic’s current implementation details, study looks_like_utf8() in src/lookslike.c. (There is also a UTF-16 version of that function, typically needed on Windows.) > Is it possible to define such binary file format that to be properly processed > by fossil (of course, after removing the explicit binary file checks)? > > Or the opposite question: Is it possible to compose such text file that to > not be > processed properly by fossil algorithms? Both questions should be answered by a study of that heuristic function. If you have further questions, make your questions more specific. Your current questions are so vague that I can give the answer “Yes” to both, and be correct. Not useful, I realize, but correct. :) ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users