Dear all,

To follow up to the suggestion Boris made in the discussion on https://issues.apache.org/jira/browse/XERCESC-2204, I would like to outline a proposal for the development and release of a version 4.0.0 of Xerces-C.  Being a new major version, this would allow co-installation with the 3.x library.

Some of the suggested changes already have issues as part of the 3.3.0 release (https://issues.apache.org/jira/projects/XERCESC/versions/12346666)


In using Xerces-C++ for the past 10 years, I have encountered quite a few compatibility, usability and performance problems.  Most of these can be worked around to some degree, but improving the situation would be desirable.  One of the issues I encountered was difficulty in building on modern platforms, Windows in particular, which was the impetus for developing the CMake build now incorporated officially in the Xerces-C 3.2.x releases.


One problem is the use of UTF-16 character strings in our APIs.  They are deeply unfriendly to use with typical C++ code, and they also impose a performance penalty when the input to Xerces API calls need transcoding.  These are too deeply entrenched to consider removing at this point, but C++11 brings native char16_t and char32_t character types which could alleviate some of the problems.  Depending upon the platform and compiler, Xerces-C currently supports various types as XMLCh, including unsigned short, uint16_t, wchar_t and char16_t.  With C++11, it is possible to use UTF-16 string literals directly with u"string", and use these transparently with the Xerces-C API.  While this is possible with the 3.2.x releases if you build with char16_t support, as an application developer the availability of char16_t support in Xerces-C can't be guaranteed.  By making XMLCh char16_t, we become directly interoperable with the current C++ language types, library features and so on.  It means any application developer can use UTF-16 character literals and string literals freely.

An additional benefit of C++11 is the guaranteed availability of standard sized integer types.

The change in PR#21 (https://github.com/apache/xerces-c/pull/21/files#diff-6fc894653e06e51bde4bbba985b1b340R126) makes the basic primitive types use C++11 integer types and character types.  It remains API compatible with previous Xerces-C releases.

With a switch of XMLCh to char16_t, this would enable direct use of unicode string literals.  I have developed a complete switchover here (https://github.com/rleigh-codelibre/xerces-c/compare/xerces-XERCESC-2208_Use_cstdint...rleigh-codelibre:XERCESC-2206_unicode_literals?expand=1#diff-5feef1625e289192e9a2d9b2c1a1308bR147) but am still proofreading and reviewing the result before I will submit it.  This is a complete replacement of most use of XMLUniDefs constants in the source tree with C++11 literals. You can see the result is vastly more readable and maintainable, and you'll also see in the commit messages that I uncovered three separate bugs in the existing strings while doing the conversion which I'll fix separately on the 3.2 branch after I'm sure there are no other bugs there.  The code of applications using Xerces-C becomes similarly more readable.  This one is a bit big and intimidating, but almost all the changes were made with search and replace with sed or other tools.


On the interoperability side, Xerces has never really integrated well with the standard library.  Whether you want to catch exceptions, use strings and streams, these all require extra effort to use.  I'd like to refactor some of the classes to ease use with applications using the standard library.  This includes:

* having exceptions derive from std::runtime_error; existing types can remain compatible with wide strings

* support use of streams, including stringstreams, directly with Xerces e.g. InputSource, perhaps as a set of adaptors

* where the C++ language and standard replace functionality in Xerces, it would be worth considering replacement where there is a benefit; language thread support might come under this category


On the maintainability side, I'd like to reduce the number of configuration options to keep testing and support within reason.  Adopting C++11 removes a lot of complexity and configuration variants.  In addition:

* We have three message loaders and three sets of translations for en_US, but no other translations.  Some or all of them might be worth thinking about dropping given the complete lack of utility these provide.  Does anyone actually use the translation functionality or have any message catalogues other than en_US?

* We have several network accessors.  But with the modern push for using HTTPS everywhere, should Xerces be providing its own or should we simply require CURL or platform-specific functionality?

* Building with a modern compiler or using a modern IDE flags up tens of thousands of warnings.  Some can be fixed by using features like "mutable" which previously were not universally available.  It would be worth running the codebase through clang-format and clang-tidy to clean it up stepwise.  It could well fix quite a few bugs and help improve performance and correctness.


Finally, I should note that while the above might look quite disruptive, I'm not suggesting any sort of API breakage at this point.  It may be the case that there are others who would like to propose such changes, or existing issues which can only resolved with a breaking change.


Kind regards,

Roger

Reply via email to