Re: New features for xerces-c 4.0.0?

Roger Leigh Thu, 16 Jul 2020 13:45:25 -0700

Hi Francesco,

To clarify a few points in your message. Xerces-C++ has always beeneffectively UTF-16-only. Previously the 16-bit type used to representXMLCh was one of several possible types, including char16_t, wchar_t,uint16_t, unsigned short or unsigned int or possibly others, dependingupon your platform. The change I proposed for the 4.0.0 was to switchto C++11/14 and use char16_t. I.e. standardising upon the standard typefor a UTF-16 codepoint. It doesn't change any assumption about UTF-16usage internally: those assumptions were already in place. The purposeof the change is as Scott stated: it's to improve interoperability andallow for the use of UTF-16 character and string literals, and to removeplatform-specific variation in favour of a single type that's consistentacross platforms.

Personally, I would absolutely prefer for Xerces-C++ to use UTF-8 in itsexternal interfaces and its internal representation. Like most people,I have input in UTF-8, output in UTF-8, and all of the parameters I wantto pass into Xerces-C++ like element and attribute names, text contentetc. are UTF-8. All of this needs transcoding to and from UTF-16. Thisbears a hugely significant fraction (~50%) of CPU utilisation when I'veprofiled it in the past, and it makes using Xerces-C++ unnecessarilypainful. But Xerces-C++ is a product of its time. Back then, beforewidespread UTF-8 adoption, it was likely seen as forward-looking. ICUand other libraries of the same era are all using UTF-16, orUnicode/UCS-2 as it was then.

However, such a change would be massively breaking. I don't have thetime or resources to do the work, and even if I did it would be hard tojustify such a breaking change unless it could be introduced withoutbreaking compatibility with the UTF-16 interfaces. If such a changecould be made compatibly, I would be in favour of it, but I doubt Icould spare the required time and effort to do it myself. I no longerget paid to work on Xerces-C++-related projects, and I have a full-timejob to do which is of much greater priority. What time I can contributeas part of Xerces-C++-using open source project work is limited and assuch I need to make sure that the work I do is tightly-focussed andrealistic in its objectives. The above char16_t work is an example ofsuch a change. It takes great care to avoid a compatibility break (youcan already opt into it with Xerces-C++ 3.2.x).

If you would like to investigate the changes which would be needed tochange the internal representation from UTF-16 to UTF-8 and/orsupplement the external interfaces with UTF-8 alternatives to the UTF-16interfaces we have at present, I'm sure we would all be very interestedto hear your proposals. As a long-term project goal, I think it wouldbe beneficial. For myself, the question isn't whether the change isdesirable, it's whether it's realistic and achievable while not breakingall the existing projects which have invested time and money into usingXerces-C++. Xerces-C++ has a long history at this point, and breakingchanges are not something which I think any of us would countenance. (The current interfaces do expose some internal details; maybehiding/changing some of them might be justifiable; but certainly not thecore user-facing APIs.)



Kind regards,

Roger


On 16/07/2020 14:15, Francesco Pretto wrote:

Migrating XMLChar to char16_t basically means setting in stone and
forever that xerces-c is an utf-16 only library so it's going in a
radically different direction than I was suggesting, so I'm not very
happy to hear about it. I think it can be safely stated that this move
actually closes more doors than it opens. While simplifying the code
base and test grid is great, as I'm reading in [1], I would have
chosen to drop support for wchar_t and int16_t but keep XMLChar for
future possibility to support utf8 for internal encoding. Are you really
sure you want to pursue this direction?

[1] https://issues.apache.org/jira/browse/XERCESC-2206



On Thu, 16 Jul 2020 at 14:22, Cantor, Scott <canto...@osu.edu> wrote:

On 7/16/20, 8:07 AM, "Francesco Pretto" <cez...@gmail.com> wrote:

    Thank you, and thank you for frankness! Probably of the two the utf-8
    for internal encoding would be more oriented towards c++ modernization
    changes, as you said, but probably a big change touching all the code
    base.

It's massive. What Roger is doing is making XMLChar work as char16_t. That 
eliminates a lot of problems with literals and STL integration, but it doesn’t 
change the fact that virtually every other C or older C++ library still won't 
integrate well.

-- Scott



---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscr...@xerces.apache.org
For additional commands, e-mail: c-dev-h...@xerces.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscr...@xerces.apache.org
For additional commands, e-mail: c-dev-h...@xerces.apache.org

Re: New features for xerces-c 4.0.0?

Reply via email to