On 8. 6. 25 20:55, Nathan Hartman wrote:
On Tue, May 20, 2025 at 2:07 PM Branko Čibej <br...@apache.org> wrote:
On 18. 5. 25 21:48, Branko Čibej wrote:
XML has the unenviable distinction of being *both* almost
unreadable for humans *and* very finicky to parse for machines.
There's one other nasty problem with XML: it can't represent every
character. There's a test for that, xml_unsafe_author2() in
prop_tests.py and discussion at
https://issues.apache.org/jira/browse/SVN-4415
but the really painful par is that our comand-line client is quite
happy to produce invalid XML. Yeah, the /expected output/ in that
test case is invalid XML, heh. I've been thinking about how to
solve this; we can't use &#/xx/; character entities, we can't use
<![CDATA[...]]> sections – both are transparent to invalid XML
chars. Of course I'm talking about our XML output here; we could
base64- or quoted-printable-encode values that are not valid XML,
and we wouldn't be breaking any existing use cases.
Well, that's for command-line output. An XML patch format has
similar issues. Any patch format does, but XML is especially nasty
in that respect.
I created SVN-4919 to track this in the client and to annotate the
test.
-- Brane
I know we've been discussing an XML-based format for xpatch, including
the pros & cons of being XML-based...
And then I came across this:
[1] https://diffx.org/
This is a page that proposes enhancing the unidiff format in a
backwards- and forwards-compatible way while remaining human readable;
it proposes calling format Extensible Diff or DiffX.
I have done only a cursory skimming through the site and though I have
not done a thorough analysis, I think this is interesting enough to at
least look through and consider.
I'll give it a more careful reading a bit later and will organize my
thoughts about it; for now, I just wanted to point out that this exists.
Thoughts/feedback?
Looks good at first glance but I detect a certain failure of imagination
from the authors. Because if the format is extensible, but the
extensions aren't standardised and codified, then we're back to where we
are now: with 17 different, almost-but-not-quite compatible diff
formats. For example, they carry on about character encodings, but spend
not one word on newlines. Or normalization forms. Or any of the other
100 ways the "same" character encoding may send you gibbering over a cliff.
Yeah, the .diff extension, when the standard since at least 40 years ago
is .patch. Guess what? These people don't have a clue. No, really, I
mean it.
Mutability. Sooooo ... unidiffs aren't mutable? That's a selling point?
Their example about the "encoding" attribute is wrong. It says:
#..preamble: encoding=utf-32, length=217
and then goes on to say:
|length|(integer –/required/):
The length of the section’s content in bytes.
Please show me a valid utf-32 string that's 217 bytes long.
Line endings ... oh, yes, they're mentioned in the spec. Except that
there's no provision for mixed line endings, which we have to deal with
far too often.
DiffX files have no default encoding.
Oh cool. But your spec assumes the encoding is superset of ASCII. The
spec doesn't support EBCDIC or other different encodings. I guess, these
days, that's sort of manageable. But they don't even mention anything
that's not compatible with ASCII, and call it "universal".
I'm rambling. But, basically, this proposal is as much of a mess as any
other. They don't even give a formal syntax that parsers could follow,
just a bunch of examples and hand-waving. Yet another wannabe spec that
doesn't start with a testable theory of changes -- a diff algebra if you
like, with all the various mutations and edge cases -- and dives
straight into "let's take unidiff and tweak it a bit". I guess the other
way is a lot of work and sounds too much like maths. They don't even
consider how to represent something that can be 3-way merged, let alone
4-way. Tree mutations? What are those? Etc. ad nauseam.
TL;DR: It's well-meaning crap, which is the worst kind.
-- Brane