On 8. 6. 25 20:55, Nathan Hartman wrote:
On Tue, May 20, 2025 at 2:07 PM Branko Čibej <br...@apache.org> wrote:

    On 18. 5. 25 21:48, Branko Čibej wrote:
    XML has the unenviable distinction of being *both* almost
    unreadable for humans *and* very finicky to parse for machines.

    There's one other nasty problem with XML: it can't represent every
    character. There's a test for that, xml_unsafe_author2() in
    prop_tests.py and discussion at

    https://issues.apache.org/jira/browse/SVN-4415

    but the really painful par is that our comand-line client is quite
    happy to produce invalid XML. Yeah, the /expected output/ in that
    test case is invalid XML, heh. I've been thinking about how to
    solve this; we can't use &#/xx/; character entities, we can't use
    <![CDATA[...]]> sections – both are transparent to invalid XML
    chars. Of course I'm talking about our XML output here; we could
    base64- or quoted-printable-encode values that are not valid XML,
    and we wouldn't be breaking any existing use cases.

    Well, that's for command-line output. An XML patch format has
    similar issues. Any patch format does, but XML is especially nasty
    in that respect.

    I created SVN-4919 to track this in the client and to annotate the
    test.

    -- Brane



I know we've been discussing an XML-based format for xpatch, including the pros & cons of being XML-based...

And then I came across this:

[1] https://diffx.org/

This is a page that proposes enhancing the unidiff format in a backwards- and forwards-compatible way while remaining human readable; it proposes calling format Extensible Diff or DiffX.

I have done only a cursory skimming through the site and though I have not done a thorough analysis, I think this is interesting enough to at least look through and consider.

I'll give it a more careful reading a bit later and will organize my thoughts about it; for now, I just wanted to point out that this exists.

Thoughts/feedback?


Looks good at first glance but I detect a certain failure of imagination from the authors. Because if the format is extensible, but the extensions aren't standardised and codified, then we're back to where we are now: with 17 different, almost-but-not-quite compatible diff formats. For example, they carry on about character encodings, but spend not one word on newlines. Or normalization forms. Or any of the other 100 ways the "same" character encoding may send you gibbering over a cliff.

Yeah, the .diff extension, when the standard since at least 40 years ago is .patch. Guess what? These people don't have a clue. No, really, I mean it.

Mutability. Sooooo ... unidiffs aren't mutable? That's a selling point?

Their example about the "encoding" attribute is wrong. It says:

#..preamble:  encoding=utf-32, length=217

and then goes on to say:

|length|(integer –/required/):

    The length of the section’s content in bytes.


Please show me a valid utf-32 string that's 217 bytes long.


Line endings ... oh, yes, they're mentioned in the spec. Except that there's no provision for mixed line endings, which we have to deal with far too often.


DiffX files have no default encoding.

Oh cool. But your spec assumes the encoding is superset of ASCII. The spec doesn't support EBCDIC or other different encodings. I guess, these days, that's sort of manageable. But they don't even mention anything that's not compatible with ASCII, and call it "universal".


I'm rambling. But, basically, this proposal is as much of a mess as any other. They don't even give a formal syntax that parsers could follow, just a bunch of examples and hand-waving. Yet another wannabe spec that doesn't start with a testable theory of changes -- a diff algebra if you like, with all the various mutations and edge cases -- and dives straight into "let's take unidiff and tweak it a bit". I guess the other way is a lot of work and sounds too much like maths. They don't even consider how to represent something that can be 3-way merged, let alone 4-way. Tree mutations? What are those? Etc. ad nauseam.

TL;DR: It's well-meaning crap, which is the worst kind.

-- Brane

Reply via email to