I noticed that your list doesn't include "add a DOM equality operator". That seems potentially simpler to implement than canonical XML serialization, and like a useful thing to have in any case. Would it make sense as an option?
On Mon, Mar 18, 2019, 15:46 Raymond Hettinger <raymond.hettin...@gmail.com> wrote: > We're having a super interesting discussion on > https://bugs.python.org/issue34160 . It is now marked as a release > blocker and warrants a broader discussion. > > Our problem is that at least two distinct and important users have written > tests that depend on exact byte-by-byte comparisons of the final > serialization. So any changes to the XML modules will break those tests > (not the applications themselves, just the test cases that assume the > output will be forever, byte-by-byte identical). > > In theory, the tests are incorrectly designed and should not treat the > module output as a canonical normal form. In practice, doing an equality > test on the output is the simplest, most obvious approach, and likely is > being done in other packages we don't know about yet. > > With pickle, json, and __repr__, the usual way to write a test is to > verify a roundtrip: assert pickle.loads(pickle.dumps(data)) == data. With > XML, the problem is that the DOM doesn't have an equality operator. The > user is left with either testing specific fragments with > element.find(xpath) or with using a standards compliant canonicalization > package (not available from us). Neither option is pleasant. > > The code in the current 3.8 alpha differs from 3.7 in that it removes > attribute sorting and instead preserves the order the user specified when > creating an element. As far as I can tell, there is no objection to this > as a feature. The problem is what to do about the existing tests in > third-party code, what guarantees we want to make going forward, and what > do we recommend as a best practice for testing XML generation. > > Things we can do: > > 1) Revert back to the 3.7 behavior. This of course, makes all the test > pass :-) The downside is that it perpetuates the practice of bytewise > equality tests and locks in all implementation quirks forever. I don't > know of anyone advocating this option, but it is the simplest thing to do. > > 2). Go into every XML module and add attribute sorting options to each > function that generate xml. This gives users a way to make their tests > pass for now. There are several downsides. a) It grows the API in a way > that is inconsistent with all the other XML packages I've seen. b) We'll > have to test, maintain, and document the API forever -- the API is already > large and time consuming to teach. c) It perpetuates the notion that > bytewise equality tests are the right thing to do, so we'll have this > problem again if substitute in another code generator or alter any of the > other implementation quirks (i.e. how CDATA sections are serialized). > > 3) Add a standards compliant canonicalization tool (see > https://en.wikipedia.org/wiki/Canonical_XML ). This is likely to be the > right-way-to-do-it but takes time and energy. > > 4) Fix the tests in the third-party modules to be more focused on their > actual test objectives, the semantics of the generated XML rather than the > exact serialization. This option would seem like the right-thing-to-do but > it isn't trivial because the entire premise of the existing test is > invalid. For every case, we'll actually have to think through what the > test objective really is. > > Of these, option 2 is my least preferred. Ideally, we don't guarantee > bytewise identical output across releases, and ideally we don't grow a new > API that perpetuates the issue. That said, I'm not wedded to any of these > options and just want us to do what is best for the users in the long run. > > Regardless of option chosen, we should make explicit whether on not the > Python standard library modules guarantee cross-release bytewise identical > output for XML. That is really the core issue here. Had we had an explicit > notice one way or the other, there wouldn't be an issue now. > > Any thoughts? > > > > Raymond Hettinger > > > P.S. Stefan Behnel is planning to remove attribute sorting from lxml. > On the bug tracker, he has clearly articulated his reasons. > > > _______________________________________________ > Python-Dev mailing list > Python-Dev@python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > https://mail.python.org/mailman/options/python-dev/njs%40pobox.com >
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com