Hi Francesco,
To clarify a few points in your message. Xerces-C++ has always been
effectively UTF-16-only. Previously the 16-bit type used to represent
XMLCh was one of several possible types, including char16_t, wchar_t,
uint16_t, unsigned short or unsigned int or possibly others, depending
upon your platform. The change I proposed for the 4.0.0 was to switch
to C++11/14 and use char16_t. I.e. standardising upon the standard type
for a UTF-16 codepoint. It doesn't change any assumption about UTF-16
usage internally: those assumptions were already in place. The purpose
of the change is as Scott stated: it's to improve interoperability and
allow for the use of UTF-16 character and string literals, and to remove
platform-specific variation in favour of a single type that's consistent
across platforms.
Personally, I would absolutely prefer for Xerces-C++ to use UTF-8 in its
external interfaces and its internal representation. Like most people,
I have input in UTF-8, output in UTF-8, and all of the parameters I want
to pass into Xerces-C++ like element and attribute names, text content
etc. are UTF-8. All of this needs transcoding to and from UTF-16. This
bears a hugely significant fraction (~50%) of CPU utilisation when I've
profiled it in the past, and it makes using Xerces-C++ unnecessarily
painful. But Xerces-C++ is a product of its time. Back then, before
widespread UTF-8 adoption, it was likely seen as forward-looking. ICU
and other libraries of the same era are all using UTF-16, or
Unicode/UCS-2 as it was then.
However, such a change would be massively breaking. I don't have the
time or resources to do the work, and even if I did it would be hard to
justify such a breaking change unless it could be introduced without
breaking compatibility with the UTF-16 interfaces. If such a change
could be made compatibly, I would be in favour of it, but I doubt I
could spare the required time and effort to do it myself. I no longer
get paid to work on Xerces-C++-related projects, and I have a full-time
job to do which is of much greater priority. What time I can contribute
as part of Xerces-C++-using open source project work is limited and as
such I need to make sure that the work I do is tightly-focussed and
realistic in its objectives. The above char16_t work is an example of
such a change. It takes great care to avoid a compatibility break (you
can already opt into it with Xerces-C++ 3.2.x).
If you would like to investigate the changes which would be needed to
change the internal representation from UTF-16 to UTF-8 and/or
supplement the external interfaces with UTF-8 alternatives to the UTF-16
interfaces we have at present, I'm sure we would all be very interested
to hear your proposals. As a long-term project goal, I think it would
be beneficial. For myself, the question isn't whether the change is
desirable, it's whether it's realistic and achievable while not breaking
all the existing projects which have invested time and money into using
Xerces-C++. Xerces-C++ has a long history at this point, and breaking
changes are not something which I think any of us would countenance.
(The current interfaces do expose some internal details; maybe
hiding/changing some of them might be justifiable; but certainly not the
core user-facing APIs.)
Kind regards,
Roger
On 16/07/2020 14:15, Francesco Pretto wrote:
Migrating XMLChar to char16_t basically means setting in stone and
forever that xerces-c is an utf-16 only library so it's going in a
radically different direction than I was suggesting, so I'm not very
happy to hear about it. I think it can be safely stated that this move
actually closes more doors than it opens. While simplifying the code
base and test grid is great, as I'm reading in [1], I would have
chosen to drop support for wchar_t and int16_t but keep XMLChar for
future possibility to support utf8 for internal encoding. Are you really
sure you want to pursue this direction?
[1] https://issues.apache.org/jira/browse/XERCESC-2206
On Thu, 16 Jul 2020 at 14:22, Cantor, Scott <canto...@osu.edu> wrote:
On 7/16/20, 8:07 AM, "Francesco Pretto" <cez...@gmail.com> wrote:
Thank you, and thank you for frankness! Probably of the two the utf-8
for internal encoding would be more oriented towards c++ modernization
changes, as you said, but probably a big change touching all the code
base.
It's massive. What Roger is doing is making XMLChar work as char16_t. That
eliminates a lot of problems with literals and STL integration, but it doesn’t
change the fact that virtually every other C or older C++ library still won't
integrate well.
-- Scott
---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscr...@xerces.apache.org
For additional commands, e-mail: c-dev-h...@xerces.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscr...@xerces.apache.org
For additional commands, e-mail: c-dev-h...@xerces.apache.org