[
https://issues.apache.org/jira/browse/LOGCXX-400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775006#comment-16775006
]
Thorsten Schöning commented on LOGCXX-400:
------------------------------------------
The current solution simply fixes the compile time errors using CASTs and [it
has been mentioned on the
list|http://mail-archives.apache.org/mod_mbox/logging-dev/201902.mbox/<CAPw4iV656uUBKZsFBwHr5OqT_QkhCHQNMkuv1aTbumfZWAUYuw%40mail.gmail.com>]
that this might not be the correct fix because of a loss of data and e.g.
"unsigned char" should be used to store UTF-8 instead. So I've created the new
branch "logcxx_400_unsigned_vs_cast" to see what happens and here's my
findings: log4cxx internally heavily relies on using "char" and "char*" instead
of "unsigned char" and "unsigned char*" for not only storing data itself, but
for conversions of in theory different types as well.
While log4cxx defines some abstract "logchar" and "LogString", they use "char"
and "wchar_t" in most cases, being managed by some "std::basic_string".
"std::basic_string<char>" and "std::basic_string<wchar_t>" are simply the basic
types of "std::string" and "std::wstring" and log4cxx assumes that at many
places in it's handling of different character sets. While there's some API
encoding from/decoding into abstract "LogString", at many places this
ultimately results in conversion routines implemented based on "std::string"
and "std::wstring" being used and simply assuming UTF-8 and UCS-4 because of
the special types. In case of an "unsigned char" instead of a "char", the type
is not compatible anymore with "std::string" and one needs to either CAST
things or implement character conversions for both individual char types (using
templates) or convert using some other approach. While I didn't test it, the
same most likely happens when using "unsigned wchar_t" as well.
Additionally there's e.g. the class "ByteBuffer" storing "char*" by default and
which is used to e.g. glue byte- and character-related APIs together, as well
as to interact with e.g. APR-related API. Changing to "unsigned char*" makes
many of those calls incompatible and additional CASTs necessary or in theory
one might even need to consider really converting data itself. But "ByteBuffer"
in many cases is only used as a read-only view to data actually managed
elsewhere, e.g. in some "std::string". If one would not CAST the changes and
really copy data instead, performance would suffer.
I didn't care much for backwards compatibility currently and instead focussed
on changing types to see what happens, but if one cares this most likely
results in additional methods taking "unsigned" version of types where
non-specified are currently used. But are these changes really necessary in the
end?
In my opinion they are not and that's why I will end work on this branch for
now. A "char" is capable of storing 8 bits of information in the contexts we
care about and the compiler error this bug is about is not because of loosing
data on a storage level, but because of possible misinterpretion regarding
signed and unsigned values. The values itself don't exceed the storage quotas
introduced by the type they are ultimately CASTed to, it's only that e.g. 0x80
means different numbers in case of unsigned vs. signed char. But no bit of
information is lost and that's what's important in the end as we only need to
store some combination of bits. "std::string" aka "std::basic_string<char>" is
perfectly fine to store UTF-8 data:
{quote}The char type can be used to store characters from the ASCII character
set or any of the ISO-8859 character sets, and individual bytes of multi-byte
characters such as Shift-JIS or the UTF-8 encoding of the Unicode character
set.{quote}
https://docs.microsoft.com/de-de/cpp/cpp/char-wchar-t-char16-t-char32-t?view=vs-2017
{quote}Unless your application is API-call-centric, e.g. mainly UI application,
the suggestion is to store Unicode strings in std::string and encoded in UTF-8,
performing conversion near API calls.{quote}
https://stackoverflow.com/a/1975414/2055163
{quote}So the answer above - on unix use always UTF8 (std::string, char), on
Windows UTF16 (std::wstring, wchar_t) is true.{quote}
https://stackoverflow.com/a/8513744/2055163
{quote}std::string utf8 = boost::locale::conv::utf_to_utf<char>(&point, &point
+ 1);{quote}
https://stackoverflow.com/a/11603682/2055163
{quote}std::basic_string doesn't deal with Unicode encodings. They certainly
can store UTF-encoded strings. But they can only think of them as sequences of
char, char16_t, or char32_t;[...]{quote}
https://stackoverflow.com/a/6797458/2055163
{quote}char = ANSI/MBCS or UTF-8{quote}
https://stackoverflow.com/a/48817457/2055163
While there are proposals to add some native UTF-8 type to C++ which is most
likely comparable to "unsigned char" the reasons for that is not that "char"
can't hold the data, but only that it might be signed on some and unsigned on
other platforms and one might need to convert in case of numerical calculations.
{quote}Whether char is a signed or unsigned type is implementation defined and
implementations that use an 8-bit signed char are at a disadvantage with
respect to working with UTF-8 encoded text due to the necessity of having to
rely on conversions to unsigned types in order to correctly process leading and
continuation code units of multi-byte encoded code points.{quote}
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0482r0.html
Those mentioned CASTs and numerical calculations of individual bytes can be
found in e.g. "Transcoder" of log4cxx already. And those might get easier in
future, but that doesn't mean that using "char" itself is broken regarding
UTF-8. I don't see a reason currently why defining the types a little bit
formal more correct in some platforms should be worth it dealing with all those
CASTs and/or data conversions not necessary before. Keep in mind that on some
platforms "char" is already "unsigned char" internally and such.
> C++11 does not allow char literals with highest bit set unless cast
> -------------------------------------------------------------------
>
> Key: LOGCXX-400
> URL: https://issues.apache.org/jira/browse/LOGCXX-400
> Project: Log4cxx
> Issue Type: Bug
> Affects Versions: 0.10.0
> Environment: XCode 4.5, clang set to C++11
> Reporter: Andrew Lazarus
> Assignee: Thorsten Schöning
> Priority: Major
> Fix For: 0.11.0
>
> Original Estimate: 1h
> Remaining Estimate: 1h
>
> C++11 has tighter rules for widening of literals. In particular, 0xFF (for
> example) is an unsigned short unless cast otherwise.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)