Proposal for future development of Xerces-C 4.0.0

Roger Leigh Tue, 16 Jun 2020 11:21:39 -0700

Dear all,

To follow up to the suggestion Boris made in the discussion onhttps://issues.apache.org/jira/browse/XERCESC-2204, I would like tooutline a proposal for the development and release of a version 4.0.0 ofXerces-C. Being a new major version, this would allow co-installationwith the 3.x library.

Some of the suggested changes already have issues as part of the 3.3.0release (https://issues.apache.org/jira/projects/XERCESC/versions/12346666)

In using Xerces-C++ for the past 10 years, I have encountered quite afew compatibility, usability and performance problems. Most of thesecan be worked around to some degree, but improving the situation wouldbe desirable. One of the issues I encountered was difficulty inbuilding on modern platforms, Windows in particular, which was theimpetus for developing the CMake build now incorporated officially inthe Xerces-C 3.2.x releases.

One problem is the use of UTF-16 character strings in our APIs. Theyare deeply unfriendly to use with typical C++ code, and they also imposea performance penalty when the input to Xerces API calls needtranscoding. These are too deeply entrenched to consider removing atthis point, but C++11 brings native char16_t and char32_t charactertypes which could alleviate some of the problems. Depending upon theplatform and compiler, Xerces-C currently supports various types asXMLCh, including unsigned short, uint16_t, wchar_t and char16_t. WithC++11, it is possible to use UTF-16 string literals directly withu"string", and use these transparently with the Xerces-C API. Whilethis is possible with the 3.2.x releases if you build with char16_tsupport, as an application developer the availability of char16_tsupport in Xerces-C can't be guaranteed. By making XMLCh char16_t, webecome directly interoperable with the current C++ language types,library features and so on. It means any application developer can useUTF-16 character literals and string literals freely.

An additional benefit of C++11 is the guaranteed availability ofstandard sized integer types.

The change in PR#21(https://github.com/apache/xerces-c/pull/21/files#diff-6fc894653e06e51bde4bbba985b1b340R126)makes the basic primitive types use C++11 integer types and charactertypes. It remains API compatible with previous Xerces-C releases.

With a switch of XMLCh to char16_t, this would enable direct use ofunicode string literals. I have developed a complete switchover here(https://github.com/rleigh-codelibre/xerces-c/compare/xerces-XERCESC-2208_Use_cstdint...rleigh-codelibre:XERCESC-2206_unicode_literals?expand=1#diff-5feef1625e289192e9a2d9b2c1a1308bR147)but am still proofreading and reviewing the result before I will submitit. This is a complete replacement of most use of XMLUniDefs constantsin the source tree with C++11 literals. You can see the result is vastlymore readable and maintainable, and you'll also see in the commitmessages that I uncovered three separate bugs in the existing stringswhile doing the conversion which I'll fix separately on the 3.2 branchafter I'm sure there are no other bugs there. The code of applicationsusing Xerces-C becomes similarly more readable. This one is a bit bigand intimidating, but almost all the changes were made with search andreplace with sed or other tools.

On the interoperability side, Xerces has never really integrated wellwith the standard library. Whether you want to catch exceptions, usestrings and streams, these all require extra effort to use. I'd like torefactor some of the classes to ease use with applications using thestandard library. This includes:

* having exceptions derive from std::runtime_error; existing types canremain compatible with wide strings

* support use of streams, including stringstreams, directly with Xercese.g. InputSource, perhaps as a set of adaptors

* where the C++ language and standard replace functionality in Xerces,it would be worth considering replacement where there is a benefit;language thread support might come under this category

On the maintainability side, I'd like to reduce the number ofconfiguration options to keep testing and support within reason. Adopting C++11 removes a lot of complexity and configuration variants. In addition:

* We have three message loaders and three sets of translations foren_US, but no other translations. Some or all of them might be worththinking about dropping given the complete lack of utility theseprovide. Does anyone actually use the translation functionality or haveany message catalogues other than en_US?

* We have several network accessors. But with the modern push for usingHTTPS everywhere, should Xerces be providing its own or should we simplyrequire CURL or platform-specific functionality?

* Building with a modern compiler or using a modern IDE flags up tens ofthousands of warnings. Some can be fixed by using features like"mutable" which previously were not universally available. It would beworth running the codebase through clang-format and clang-tidy to cleanit up stepwise. It could well fix quite a few bugs and help improveperformance and correctness.

Finally, I should note that while the above might look quite disruptive,I'm not suggesting any sort of API breakage at this point. It may bethe case that there are others who would like to propose such changes,or existing issues which can only resolved with a breaking change.



Kind regards,

Roger

Proposal for future development of Xerces-C 4.0.0

Reply via email to