Hi Francesco,

To clarify a few points in your message. Xerces-C++ has always been effectively UTF-16-only.  Previously the 16-bit type used to represent XMLCh was one of several possible types, including char16_t, wchar_t, uint16_t, unsigned short or unsigned int or possibly others, depending upon your platform.  The change I proposed for the 4.0.0 was to switch to C++11/14 and use char16_t.  I.e. standardising upon the standard type for a UTF-16 codepoint.  It doesn't change any assumption about UTF-16 usage internally: those assumptions were already in place.  The purpose of the change is as Scott stated: it's to improve interoperability and allow for the use of UTF-16 character and string literals, and to remove platform-specific variation in favour of a single type that's consistent across platforms.


Personally, I would absolutely prefer for Xerces-C++ to use UTF-8 in its external interfaces and its internal representation.  Like most people, I have input in UTF-8, output in UTF-8, and all of the parameters I want to pass into Xerces-C++ like element and attribute names, text content etc. are UTF-8.  All of this needs transcoding to and from UTF-16.  This bears a hugely significant fraction (~50%) of CPU utilisation when I've profiled it in the past, and it makes using Xerces-C++ unnecessarily painful.  But Xerces-C++ is a product of its time.  Back then, before widespread UTF-8 adoption, it was likely seen as forward-looking.  ICU and other libraries of the same era are all using UTF-16, or Unicode/UCS-2 as it was then.


However, such a change would be massively breaking.  I don't have the time or resources to do the work, and even if I did it would be hard to justify such a breaking change unless it could be introduced without breaking compatibility with the UTF-16 interfaces.  If such a change could be made compatibly, I would be in favour of it, but I doubt I could spare the required time and effort to do it myself.  I no longer get paid to work on Xerces-C++-related projects, and I have a full-time job to do which is of much greater priority.  What time I can contribute as part of Xerces-C++-using open source project work is limited and as such I need to make sure that the work I do is tightly-focussed and realistic in its objectives.  The above char16_t work is an example of such a change.  It takes great care to avoid a compatibility break (you can already opt into it with Xerces-C++ 3.2.x).


If you would like to investigate the changes which would be needed to change the internal representation from UTF-16 to UTF-8 and/or supplement the external interfaces with UTF-8 alternatives to the UTF-16 interfaces we have at present, I'm sure we would all be very interested to hear your proposals.  As a long-term project goal, I think it would be beneficial.  For myself, the question isn't whether the change is desirable, it's whether it's realistic and achievable while not breaking all the existing projects which have invested time and money into using Xerces-C++.  Xerces-C++ has a long history at this point, and breaking changes are not something which I think any of us would countenance.  (The current interfaces do expose some internal details; maybe hiding/changing some of them might be justifiable; but certainly not the core user-facing APIs.)


Kind regards,

Roger


On 16/07/2020 14:15, Francesco Pretto wrote:
Migrating XMLChar to char16_t basically means setting in stone and
forever that xerces-c is an utf-16 only library so it's going in a
radically different direction than I was suggesting, so I'm not very
happy to hear about it. I think it can be safely stated that this move
actually closes more doors than it opens. While simplifying the code
base and test grid is great, as I'm reading in [1], I would have
chosen to drop support for wchar_t and int16_t but keep XMLChar for
future possibility to support utf8 for internal encoding. Are you really
sure you want to pursue this direction?

[1] https://issues.apache.org/jira/browse/XERCESC-2206



On Thu, 16 Jul 2020 at 14:22, Cantor, Scott <canto...@osu.edu> wrote:
On 7/16/20, 8:07 AM, "Francesco Pretto" <cez...@gmail.com> wrote:

    Thank you, and thank you for frankness! Probably of the two the utf-8
    for internal encoding would be more oriented towards c++ modernization
    changes, as you said, but probably a big change touching all the code
    base.
It's massive. What Roger is doing is making XMLChar work as char16_t. That 
eliminates a lot of problems with literals and STL integration, but it doesn’t 
change the fact that virtually every other C or older C++ library still won't 
integrate well.

-- Scott



---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscr...@xerces.apache.org
For additional commands, e-mail: c-dev-h...@xerces.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscr...@xerces.apache.org
For additional commands, e-mail: c-dev-h...@xerces.apache.org

Reply via email to