Re: [fossil-users] Question about the file formats.

2016-12-21 Thread Scott Robison
On Dec 20, 2016 10:59 PM, "John Found"  wrote:

Well, the compression is the last thing I am talking about. It is
important, but not essential.

I am talking about several people working on one file and then fossil
merging the
changes automatically (of course if there is no conflicts in the edits).


I think the answer to your question is that merging depends on a knowledge
of the structure of data in order to detect where conflicts do or do not
exist. The structure of text files is "an ordered sequence of variable
length records" and the merge algorithm sees non overlapping changes as
independent. This is not always true, but it works often enough to be
useful. Because it is not always true, it is important to test post merge &
pre commit.

The merge algorithm could be modified to work with other data structures
but it would still require the property that non overlapping changes be
independent (have no impact on previous or future data). With a "binary"
format there are many other things that could go wrong. Fixed length
records, specific requirements for alignment, embedded non symbolic
references to other parts of the file are the first few that come to mind.

Without specific knowledge of the structure of the data, merge can't work.
Even with knowledge of the structure of text files, it can still get things
wrong.
___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] Question about the file formats.

2016-12-21 Thread Joerg Sonnenberger
On Tue, Dec 20, 2016 at 08:48:27PM +0200, John Found wrote:
> What makes the binary files different from the text files? The presence or 
> absence of
> 0 bytes does not seems to make serious difference for processing by the same 
> algorithms.

Many text formats allow merging changes from one version to another with
minimal context. E.g. let's say you start from a C file and modify a line
in the middle in your checkout and then update your tree. Someone else
added a new function at the beginning of the file. This creates a
conflict and Fossil will try to resolve it by finding the context of the
line you modified in a similiar place and then readd that change. While
this doesn't work all the time for text files, it has a high chance of
working. Even if it doesn't work i.e. because the changes overlap, it
provides enough information that a user can typically do the same.

The same kind of tooling could be provided for binary formats, but it is
rarely exist directly.

Joerg
___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] Question about the file formats.

2016-12-21 Thread Stephan Beal
On Dec 21, 2016 10:57 AM, "Warren Young"  wrote:


That is exactly what I’m talking about in my BMP vs PNG examples.

If you wish to discuss a different file type than than bitmap graphics,
give your own example.  Until then, mine is the only concrete example we
have available to discuss.


Zip files and similar archives apply here as well, i think (that includes
modern office suite formats,  many of which are zip files).  Without
knowing how to dissect them and diff the individual components, it can only
perform generic binary delta compression. i opine, without any proof to
back it up, that the compression  results would not be appreciably better
were fossil to "know" about such content (for most common file formats),
while performance, complexity, and memory costs would be negatively
impacted.

- stephan
Sent from a mobile device, possibly from bed. Please excuse brevity, typos,
and top-posting.
___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] Question about the file formats.

2016-12-20 Thread John Found
Well, the compression is the last thing I am talking about. It is important, 
but not essential.

I am talking about several people working on one file and then fossil merging 
the
changes automatically (of course if there is no conflicts in the edits).


On Tue, 20 Dec 2016 16:58:18 -0700
Warren Young  wrote:

> On Dec 20, 2016, at 3:57 PM, Warren Young  wrote:
> > 
> >> What if I design some text file format (containing only ascii characters) 
> >> and
> >> it can't be properly processed by fossil?
> > 
> > Then you should post it as a replicable test case for our study.
> 
> I decided to take up my own challenge.  Consider:
> 
> ## Create new repo; note initial size
> $ f init ../x.fossil
> $ ls -lh ../x.fossil 
> -rw-r--r--  1 me   group   212K Dec 20 16:13 ../x.fossil
> 
> ## Go grab a free PNG file, and re-save it with Photoshop’s
> ## Save for Web function to reduce unnecessary differences
> $ wget 
> https://upload.wikimedia.org/wikipedia/commons/thumb/3/31/Topographic_Map_of_Bulgaria_Bulgarian.png/120px-Topographic_Map_of_Bulgaria_Bulgarian.png
> $ open -a 'Adobe Photoshop CC 2017' 
> 120px-Topographic_Map_of_Bulgaria_Bulgarian.png 
> $ ls -lh 120px-Topographic_Map_of_Bulgaria_Bulgarian.png 
> -rw-r--r--  1 me   group23K Dec 20 16:12 
> 120px-Topographic_Map_of_Bulgaria_Bulgarian.png
> 
> ## Add it to repo; notice that repo size goes up by 20 kB,
> ## showing that Fossil’s internal compression managed to
> ## squeeze an additional 3 kB over what Photoshop gives,
> ## probably due to metadata compression
> $ f add 120px-Topographic_Map_of_Bulgaria_Bulgarian.png 
> $ f ci -m initial
> $ f rebuild --compress --vacuum
> $ ls -lh ../x.fossil 
> -rw-r--r--  1 me   group   232K Dec 20 16:13 ../x.fossil
> 
> ## Change upper left corner pixel, amounting to only several
> ## bits of difference in the raw data
> $ open -a 'Adobe Photoshop CC 2017' 
> 120px-Topographic_Map_of_Bulgaria_Bulgarian.png 
> $ ls -lh 120px-Topographic_Map_of_Bulgaria_Bulgarian.png 
> -rw-r--r--  1 me   group23K Dec 20 16:14 
> 120px-Topographic_Map_of_Bulgaria_Bulgarian.png
> 
> ## Check change in; notice that roughly a dozen bits of change in
> ## the raw data became 28 kB of difference in the repo size!
> $ f ci -m '1 px change’
> $ f rebuild --compress --vacuum
> $ ls -lh ../x.fossil 
> -rw-r--r--  1 me   group   260K Dec 20 16:14 ../x.fossil
> 
> 
> Repeating that test with TIFF and PSD files didn’t give as small a difference 
> in the resulting Fossil repos size between checkins as I’d expected, but on 
> investigating I found that Photoshop writes a bunch of stuff into the 
> metadata that change on every save (e.g. timestamps, UUIDs…) which balloons 
> the diffs.  
> 
> Switching to Windows BMP fixes this: a 1px change results in a negligible 
> change in the repo size, because only about a dozen bits change in the raw 
> data.  (Windows BMP has very little metadata.)
> ___
> fossil-users mailing list
> fossil-users@lists.fossil-scm.org
> http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


-- 
http://fresh.flatassembler.net
http://asm32.info
John Found 
___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] Question about the file formats.

2016-12-20 Thread Warren Young
On Dec 20, 2016, at 3:57 PM, Warren Young  wrote:
> 
>> What if I design some text file format (containing only ascii characters) and
>> it can't be properly processed by fossil?
> 
> Then you should post it as a replicable test case for our study.

I decided to take up my own challenge.  Consider:

## Create new repo; note initial size
$ f init ../x.fossil
$ ls -lh ../x.fossil 
-rw-r--r--  1 me   group   212K Dec 20 16:13 ../x.fossil

## Go grab a free PNG file, and re-save it with Photoshop’s
## Save for Web function to reduce unnecessary differences
$ wget 
https://upload.wikimedia.org/wikipedia/commons/thumb/3/31/Topographic_Map_of_Bulgaria_Bulgarian.png/120px-Topographic_Map_of_Bulgaria_Bulgarian.png
$ open -a 'Adobe Photoshop CC 2017' 
120px-Topographic_Map_of_Bulgaria_Bulgarian.png 
$ ls -lh 120px-Topographic_Map_of_Bulgaria_Bulgarian.png 
-rw-r--r--  1 me   group23K Dec 20 16:12 
120px-Topographic_Map_of_Bulgaria_Bulgarian.png

## Add it to repo; notice that repo size goes up by 20 kB,
## showing that Fossil’s internal compression managed to
## squeeze an additional 3 kB over what Photoshop gives,
## probably due to metadata compression
$ f add 120px-Topographic_Map_of_Bulgaria_Bulgarian.png 
$ f ci -m initial
$ f rebuild --compress --vacuum
$ ls -lh ../x.fossil 
-rw-r--r--  1 me   group   232K Dec 20 16:13 ../x.fossil

## Change upper left corner pixel, amounting to only several
## bits of difference in the raw data
$ open -a 'Adobe Photoshop CC 2017' 
120px-Topographic_Map_of_Bulgaria_Bulgarian.png 
$ ls -lh 120px-Topographic_Map_of_Bulgaria_Bulgarian.png 
-rw-r--r--  1 me   group23K Dec 20 16:14 
120px-Topographic_Map_of_Bulgaria_Bulgarian.png

## Check change in; notice that roughly a dozen bits of change in
## the raw data became 28 kB of difference in the repo size!
$ f ci -m '1 px change’
$ f rebuild --compress --vacuum
$ ls -lh ../x.fossil 
-rw-r--r--  1 me   group   260K Dec 20 16:14 ../x.fossil


Repeating that test with TIFF and PSD files didn’t give as small a difference 
in the resulting Fossil repos size between checkins as I’d expected, but on 
investigating I found that Photoshop writes a bunch of stuff into the metadata 
that change on every save (e.g. timestamps, UUIDs…) which balloons the diffs.  

Switching to Windows BMP fixes this: a 1px change results in a negligible 
change in the repo size, because only about a dozen bits change in the raw 
data.  (Windows BMP has very little metadata.)
___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] Question about the file formats.

2016-12-20 Thread Warren Young
On Dec 20, 2016, at 12:35 PM, John Found  wrote:
> 
> Under "fossil algorithms" I mean two (in my understanding most important in 
> what is called "version control": diff algorithm and 3-way merge algorithm.

When I said that Fossil can’t diff two binary files, I meant that it couldn’t 
display a sensible difference to the terminal when you give the “fossil diff” 
command.  However, Fossil *can store* the difference between any two files, 
regardless of binary vs. text, as I suggested with my uncompressed TIFF example.

Fossil will even do so for files like PNGs where the worst case is that a 
single bit change in the original file could potentially change every byte in 
the output file, making the internal diffs Fossil stores very large, possibly 
to the point that there’s no value in delta compression at all, so that Fossil 
must simply store both versions in toto.  But Fossil will store those versions, 
and retrieve them.

As for merging, as long as the two versions Fossil is trying to merge have 
sufficient context between the changes to safely do the merge automatically, 
Fossil will do so.

Just as with diffing, if you use compression or encryption or otherwise cause 
the merged parts to overlap, Fossil won’t be able to do the merge automatically.

This is no different for what we choose to call “text” files, where if two 
users make a change to the same area of a single file, chances are high that 
Fossil will refuse to attempt an automatic merge, since there isn’t enough 
context between the changes for Fossil to be sure it isn’t creating a mess in 
the merge area.

> Or what makes the 3-way merge algorithm not working on binary files.

Except for whole-file compression and similar cases (e.g. pre-checkin 
encryption) I don’t think you can create a replicable test that shows that it 
doesn’t work.

> What if I design some text file format (containing only ascii characters) and
> it can't be properly processed by fossil?

Then you should post it as a replicable test case for our study.  Until you can 
do both things — i.e. cause a problem and create a replicable test case for it 
— you’re just speculating.

> Another example: Every binary file can be BASE64 encoded and it will be 
> turned into a 
> valid text file. Fossil will not detect it as a binary. But whether this file 
> will be
> processed properly on diffs and merges? Probably not. But why?

I don’t believe such an encoding will have a meaningful effect on any test, 
except that it effectively adds newlines every 70-some characters, where the 
original binary data might not have it, so “binary” data would now be detected 
as “text” data.

But, if the problem is that delta compression is inefficient with a given 
binary file because nearly every byte changes when you change just one small 
bit of the input file, then the same will be true of the Base64-encoded version.
___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] Question about the file formats.

2016-12-20 Thread Andy Bradford
Thus said John Found on Tue, 20 Dec 2016 21:35:44 +0200:

> For example, I  can't see what is  the problem to make  diff of binary
> files.  As  a  result  one  will  have  the  bytes  that  have  to  be
> inserted/deleted from  the first  file in  order to  turn it  into the
> second. (Or I am wrong and that is why I ask such vague questions).

It certainly is  possible, though not currently implemented as  far as I
know. The binary diff can be  described simply as the deltas required to
get from A to B. You might experiment with the following test commands:

test-delta
test-delta-analyze
test-delta-apply
test-delta-create

It would seem that what you're asking for is a binary patch that perhaps
takes advantage  of the delta  encoding algorithm? This  would certainly
require a special binary that understands  the format of the data, but I
don't see why this shouldn't be possible.

I believe Fossil  stores the baseline and deltas going  back in time, so
to open  the most  recent version  of a  file, it  just gets  the latest
artifact, but to get older versions, it has to apply deltas.

Perhaps the following will address some of your questions:

http://www.fossil-scm.org/index.html/doc/trunk/www/delta_format.wiki
http://www.fossil-scm.org/index.html/doc/trunk/www/delta_encoder_algorithm.wiki
http://www.fossil-scm.org/index.html/doc/trunk/www/concepts.wiki


> Or what makes  the 3-way merge algorithm not working  on binary files.
> The line organization of the text files? Something else?

I'm not sure  what it uses for  binary files, but there  is definitely a
delta component being  generated and stored. As a test,  I added a large
binary AVI to a fossil.

Before:

SIZE DATE FILE
217088   Dec 20 14:47 new.fossil
15155678 Dec 20 14:48 MVI_7509.AVI

After:

15171584 Dec 20 14:48 new.fossil

I then used vi to update a few bytes in the file and committed:

30117888 Dec 20 14:49 new.fossil

It did double in  size (not sure why, but I suspect  it has something to
do with establishing a baseline for the delta). But, I repeated the edit
with vi  and changed additional  other bytes,  but this time,  it didn't
grow very much at all:

30121984 Dec 20 14:50 new.fossil

And again:

30130176 Dec 20 14:54 new.fossil

So it is clearly efficiently storing them.

Experimenting with the test-delta-create command, I get:

15155678 Dec 20 15:12 MVI_7509.AVI.first
15155695 Dec 20 15:13 MVI_7509.AVI.second

$ fossil test-delta-create MVI_7509.AVI.first MVI_7509.AVI.second 
MVI_7509.AVI.delta

66   Dec 20 15:14 MVI_7509.AVI.delta

So the delta is only 66 bytes:

$ cat MVI_7509.AVI.delta 
up7k
3sQ@0,2:ab4yT@3sQ,4:zzjkf2@8pr,5:djfjkufdb@9Ut,5:
fff
37s_Sf;

Could I  share this  with someone?  Sure. They  could then  use ``fossil
test-delta-apply'' to use the ``patch.''


> What  if  I  design  some  text file  format  (containing  only  ascii
> characters) and it can't be properly processed by fossil?

If it contains  only ASCII characters then Fossil will  have no problems
handling  it  as a  text  file.  It  won't  matter what  arrangement  of
characters you place in such a file because they will be only ASCII.


> Another example: Every  binary file can be BASE64 encoded  and it will
> be turned  into a  valid text  file. Fossil  will not  detect it  as a
> binary. But whether this file will  be processed properly on diffs and
> merges?

Sure, you'll  get an ASCII  diff of the file,  but it won't  really mean
much to describe  the BASE64 diff between two files,  however, if that's
what you want, commit all binaries  as BASE64 encoded files. Then you'll
have to  BASE64 decode the file  as part of your  ``build'' process, and
you can even send patches/diffs of them.

Andy
-- 
TAI64 timestamp: 40005859adef


___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] Question about the file formats.

2016-12-20 Thread John Found
I am not talking about the fossil heuristics in detection of what file is 
binary and
what file is text. Imagine all detection is switched off.

Under "fossil algorithms" I mean two (in my understanding most important in 
what is called "version control": diff algorithm and 3-way merge algorithm.

For example, I can't see what is the problem to make diff of binary files. As a 
result
one will have the bytes that have to be inserted/deleted from the first file in 
order to
turn it into the second. (Or I am wrong and that is why I ask such vague 
questions).

Or what makes the 3-way merge algorithm not working on binary files. The line 
organization of the text files? Something else?

What if I design some text file format (containing only ascii characters) and
it can't be properly processed by fossil?

Another example: Every binary file can be BASE64 encoded and it will be turned 
into a 
valid text file. Fossil will not detect it as a binary. But whether this file 
will be
processed properly on diffs and merges? Probably not. But why?


On Tue, 20 Dec 2016 12:13:43 -0700
Warren Young  wrote:

> On Dec 20, 2016, at 11:48 AM, John Found  wrote:
> > 
> > I know that fossil (and most other version control systems) can handle 
> > properly
> > only text source files. 
> 
> Says who?
> 
> There are some features of Fossil that simply don’t work when given a binary 
> file, like “fossil diff,” but if you think this is a missing feature (or even 
> a bug!) I’d have to ask how you think it should work?
> 
> Consider the case of a PNG.  How would you expect “fossil diff” to show the 
> difference between two PNGs?
> 
> Now multiply by the number of other binary file formats.
> 
> It is also the case that checking in compressed binary files is generally a 
> mistake, since that will largely defeat the built-in diffing and compression 
> mechanisms in Fossil, bloating the repository on every checkin.
> 
> (For some use cases, you can now avoid this problem with the new unversioned 
> files feature.)
> 
> Both of those classes of problem aside, Fossil will certainly accept “binary” 
> files. 
> 
> > What makes the binary files different from the text files? The presence or 
> > absence of
> > 0 bytes does not seems to make serious difference for processing by the 
> > same algorithms.
> 
> Fossil uses a heuristic to decide if a given file is “binary” or not, and it 
> has more to do with the chance that it will display properly when served to a 
> web browser than anything else.
> 
> Because it is a heuristic, it is possible to trick it.  For example, very 
> long text lines may be misdetected as a “binary” file, because it runs out of 
> buffer space looking for the first line terminator.
> 
> > What properties a file format needs in order to be processed properly by 
> > fossil?
> 
> Give a specific use case.  The answer differs depending on what Fossil 
> commands you want to be able to use on the files you check in.
> 
> I gave the “diff” case above, but that is not the only command that changes 
> behavior depending on whether the binary file heuristic decides that the file 
> is “binary.”
> 
> I’m putting “binary” in quotes because it is not a clear-cut distinction.  
> For Fossil’s purposes, an uncompressed TIFF is “less binary” than a PNG file, 
> because it is possible to do useful levels of delta compression on the TIFF 
> but not on the PNG.
> 
> > Is it enough for a file to contains only utf-8 characters or some other 
> > properties are
> > mandatory as well?
> 
> If you want to know the heuristic’s current implementation details, study 
> looks_like_utf8() in src/lookslike.c.
> 
> (There is also a UTF-16 version of that function, typically needed on 
> Windows.)
> 
> > Is it possible to define such binary file format that to be properly 
> > processed
> > by fossil (of course, after removing the explicit binary file checks)?
> > 
> > Or the opposite question: Is it possible to compose such text file that to 
> > not be
> > processed properly by fossil algorithms?
> 
> Both questions should be answered by a study of that heuristic function.
> 
> If you have further questions, make your questions more specific.  Your 
> current questions are so vague that I can give the answer “Yes” to both, and 
> be correct.  Not useful, I realize, but correct. :)
> ___

-- 
http://fresh.flatassembler.net
http://asm32.info
John Found 
___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] Question about the file formats.

2016-12-20 Thread Warren Young
On Dec 20, 2016, at 11:48 AM, John Found  wrote:
> 
> I know that fossil (and most other version control systems) can handle 
> properly
> only text source files. 

Says who?

There are some features of Fossil that simply don’t work when given a binary 
file, like “fossil diff,” but if you think this is a missing feature (or even a 
bug!) I’d have to ask how you think it should work?

Consider the case of a PNG.  How would you expect “fossil diff” to show the 
difference between two PNGs?

Now multiply by the number of other binary file formats.

It is also the case that checking in compressed binary files is generally a 
mistake, since that will largely defeat the built-in diffing and compression 
mechanisms in Fossil, bloating the repository on every checkin.

(For some use cases, you can now avoid this problem with the new unversioned 
files feature.)

Both of those classes of problem aside, Fossil will certainly accept “binary” 
files. 

> What makes the binary files different from the text files? The presence or 
> absence of
> 0 bytes does not seems to make serious difference for processing by the same 
> algorithms.

Fossil uses a heuristic to decide if a given file is “binary” or not, and it 
has more to do with the chance that it will display properly when served to a 
web browser than anything else.

Because it is a heuristic, it is possible to trick it.  For example, very long 
text lines may be misdetected as a “binary” file, because it runs out of buffer 
space looking for the first line terminator.

> What properties a file format needs in order to be processed properly by 
> fossil?

Give a specific use case.  The answer differs depending on what Fossil commands 
you want to be able to use on the files you check in.

I gave the “diff” case above, but that is not the only command that changes 
behavior depending on whether the binary file heuristic decides that the file 
is “binary.”

I’m putting “binary” in quotes because it is not a clear-cut distinction.  For 
Fossil’s purposes, an uncompressed TIFF is “less binary” than a PNG file, 
because it is possible to do useful levels of delta compression on the TIFF but 
not on the PNG.

> Is it enough for a file to contains only utf-8 characters or some other 
> properties are
> mandatory as well?

If you want to know the heuristic’s current implementation details, study 
looks_like_utf8() in src/lookslike.c.

(There is also a UTF-16 version of that function, typically needed on Windows.)

> Is it possible to define such binary file format that to be properly processed
> by fossil (of course, after removing the explicit binary file checks)?
> 
> Or the opposite question: Is it possible to compose such text file that to 
> not be
> processed properly by fossil algorithms?

Both questions should be answered by a study of that heuristic function.

If you have further questions, make your questions more specific.  Your current 
questions are so vague that I can give the answer “Yes” to both, and be 
correct.  Not useful, I realize, but correct. :)
___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users