Dear all,
To follow up to the suggestion Boris made in the discussion on
https://issues.apache.org/jira/browse/XERCESC-2204, I would like to
outline a proposal for the development and release of a version 4.0.0 of
Xerces-C. Being a new major version, this would allow co-installation
with the 3.x library.
Some of the suggested changes already have issues as part of the 3.3.0
release (https://issues.apache.org/jira/projects/XERCESC/versions/12346666)
In using Xerces-C++ for the past 10 years, I have encountered quite a
few compatibility, usability and performance problems. Most of these
can be worked around to some degree, but improving the situation would
be desirable. One of the issues I encountered was difficulty in
building on modern platforms, Windows in particular, which was the
impetus for developing the CMake build now incorporated officially in
the Xerces-C 3.2.x releases.
One problem is the use of UTF-16 character strings in our APIs. They
are deeply unfriendly to use with typical C++ code, and they also impose
a performance penalty when the input to Xerces API calls need
transcoding. These are too deeply entrenched to consider removing at
this point, but C++11 brings native char16_t and char32_t character
types which could alleviate some of the problems. Depending upon the
platform and compiler, Xerces-C currently supports various types as
XMLCh, including unsigned short, uint16_t, wchar_t and char16_t. With
C++11, it is possible to use UTF-16 string literals directly with
u"string", and use these transparently with the Xerces-C API. While
this is possible with the 3.2.x releases if you build with char16_t
support, as an application developer the availability of char16_t
support in Xerces-C can't be guaranteed. By making XMLCh char16_t, we
become directly interoperable with the current C++ language types,
library features and so on. It means any application developer can use
UTF-16 character literals and string literals freely.
An additional benefit of C++11 is the guaranteed availability of
standard sized integer types.
The change in PR#21
(https://github.com/apache/xerces-c/pull/21/files#diff-6fc894653e06e51bde4bbba985b1b340R126)
makes the basic primitive types use C++11 integer types and character
types. It remains API compatible with previous Xerces-C releases.
With a switch of XMLCh to char16_t, this would enable direct use of
unicode string literals. I have developed a complete switchover here
(https://github.com/rleigh-codelibre/xerces-c/compare/xerces-XERCESC-2208_Use_cstdint...rleigh-codelibre:XERCESC-2206_unicode_literals?expand=1#diff-5feef1625e289192e9a2d9b2c1a1308bR147)
but am still proofreading and reviewing the result before I will submit
it. This is a complete replacement of most use of XMLUniDefs constants
in the source tree with C++11 literals. You can see the result is vastly
more readable and maintainable, and you'll also see in the commit
messages that I uncovered three separate bugs in the existing strings
while doing the conversion which I'll fix separately on the 3.2 branch
after I'm sure there are no other bugs there. The code of applications
using Xerces-C becomes similarly more readable. This one is a bit big
and intimidating, but almost all the changes were made with search and
replace with sed or other tools.
On the interoperability side, Xerces has never really integrated well
with the standard library. Whether you want to catch exceptions, use
strings and streams, these all require extra effort to use. I'd like to
refactor some of the classes to ease use with applications using the
standard library. This includes:
* having exceptions derive from std::runtime_error; existing types can
remain compatible with wide strings
* support use of streams, including stringstreams, directly with Xerces
e.g. InputSource, perhaps as a set of adaptors
* where the C++ language and standard replace functionality in Xerces,
it would be worth considering replacement where there is a benefit;
language thread support might come under this category
On the maintainability side, I'd like to reduce the number of
configuration options to keep testing and support within reason.
Adopting C++11 removes a lot of complexity and configuration variants.
In addition:
* We have three message loaders and three sets of translations for
en_US, but no other translations. Some or all of them might be worth
thinking about dropping given the complete lack of utility these
provide. Does anyone actually use the translation functionality or have
any message catalogues other than en_US?
* We have several network accessors. But with the modern push for using
HTTPS everywhere, should Xerces be providing its own or should we simply
require CURL or platform-specific functionality?
* Building with a modern compiler or using a modern IDE flags up tens of
thousands of warnings. Some can be fixed by using features like
"mutable" which previously were not universally available. It would be
worth running the codebase through clang-format and clang-tidy to clean
it up stepwise. It could well fix quite a few bugs and help improve
performance and correctness.
Finally, I should note that while the above might look quite disruptive,
I'm not suggesting any sort of API breakage at this point. It may be
the case that there are others who would like to propose such changes,
or existing issues which can only resolved with a breaking change.
Kind regards,
Roger
- Proposal for future development of Xerces-C 4.0.0 Roger Leigh
-