[ 
https://issues.apache.org/jira/browse/LOGCXX-400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775006#comment-16775006
 ] 

Thorsten Schöning commented on LOGCXX-400:
------------------------------------------

The current solution simply fixes the compile time errors using CASTs and [it 
has been mentioned on the 
list|http://mail-archives.apache.org/mod_mbox/logging-dev/201902.mbox/<CAPw4iV656uUBKZsFBwHr5OqT_QkhCHQNMkuv1aTbumfZWAUYuw%40mail.gmail.com>]
 that this might not be the correct fix because of a loss of data and e.g. 
"unsigned char" should be used to store UTF-8 instead. So I've created the new 
branch "logcxx_400_unsigned_vs_cast" to see what happens and here's my 
findings: log4cxx internally heavily relies on using "char" and "char*" instead 
of "unsigned char" and "unsigned char*" for not only storing data itself, but 
for conversions of in theory different types as well.

While log4cxx defines some abstract "logchar" and "LogString", they use "char" 
and "wchar_t" in most cases, being managed by some "std::basic_string". 
"std::basic_string<char>" and "std::basic_string<wchar_t>" are simply the basic 
types of "std::string" and "std::wstring" and log4cxx assumes that at many 
places in it's handling of different character sets. While there's some API 
encoding from/decoding into abstract "LogString", at many places this 
ultimately results in conversion routines implemented based on "std::string" 
and "std::wstring" being used and simply assuming UTF-8 and UCS-4 because of 
the special types. In case of an "unsigned char" instead of a "char", the type 
is not compatible anymore with "std::string" and one needs to either CAST 
things or implement character conversions for both individual char types (using 
templates) or convert using some other approach. While I didn't test it, the 
same most likely happens when using "unsigned wchar_t" as well.

Additionally there's e.g. the class "ByteBuffer" storing "char*" by default and 
which is used to e.g. glue byte- and character-related APIs together, as well 
as to interact with e.g. APR-related API. Changing to "unsigned char*" makes 
many of those calls incompatible and additional CASTs necessary or in theory 
one might even need to consider really converting data itself. But "ByteBuffer" 
in many cases is only used as a read-only view to data actually managed 
elsewhere, e.g. in some "std::string". If one would not CAST the changes and 
really copy data instead, performance would suffer.

I didn't care much for backwards compatibility currently and instead focussed 
on changing types to see what happens, but if one cares this most likely 
results in additional methods taking "unsigned" version of types where 
non-specified are currently used. But are these changes really necessary in the 
end?

In my opinion they are not and that's why I will end work on this branch for 
now. A "char" is capable of storing 8 bits of information in the contexts we 
care about and the compiler error this bug is about is not because of loosing 
data on a storage level, but because of possible misinterpretion regarding 
signed and unsigned values. The values itself don't exceed the storage quotas 
introduced by the type they are ultimately CASTed to, it's only that e.g. 0x80 
means different numbers in case of unsigned vs. signed char. But no bit of 
information is lost and that's what's important in the end as we only need to 
store some combination of bits. "std::string" aka "std::basic_string<char>" is 
perfectly fine to store UTF-8 data:

{quote}The char type can be used to store characters from the ASCII character 
set or any of the ISO-8859 character sets, and individual bytes of multi-byte 
characters such as Shift-JIS or the UTF-8 encoding of the Unicode character 
set.{quote}

https://docs.microsoft.com/de-de/cpp/cpp/char-wchar-t-char16-t-char32-t?view=vs-2017

{quote}Unless your application is API-call-centric, e.g. mainly UI application, 
the suggestion is to store Unicode strings in std::string and encoded in UTF-8, 
performing conversion near API calls.{quote}

https://stackoverflow.com/a/1975414/2055163

{quote}So the answer above - on unix use always UTF8 (std::string, char), on 
Windows UTF16 (std::wstring, wchar_t) is true.{quote}

https://stackoverflow.com/a/8513744/2055163

{quote}std::string utf8 = boost::locale::conv::utf_to_utf<char>(&point, &point 
+ 1);{quote}

https://stackoverflow.com/a/11603682/2055163

{quote}std::basic_string doesn't deal with Unicode encodings. They certainly 
can store UTF-encoded strings. But they can only think of them as sequences of 
char, char16_t, or char32_t;[...]{quote}

https://stackoverflow.com/a/6797458/2055163

{quote}char = ANSI/MBCS or UTF-8{quote}

https://stackoverflow.com/a/48817457/2055163

While there are proposals to add some native UTF-8 type to C++ which is most 
likely comparable to "unsigned char" the reasons for that is not that "char" 
can't hold the data, but only that it might be signed on some and unsigned on 
other platforms and one might need to convert in case of numerical calculations.

{quote}Whether char is a signed or unsigned type is implementation defined and 
implementations that use an 8-bit signed char are at a disadvantage with 
respect to working with UTF-8 encoded text due to the necessity of having to 
rely on conversions to unsigned types in order to correctly process leading and 
continuation code units of multi-byte encoded code points.{quote}

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0482r0.html

Those mentioned CASTs and numerical calculations of individual bytes can be 
found in e.g. "Transcoder" of log4cxx already. And those might get easier in 
future, but that doesn't mean that using "char" itself is broken regarding 
UTF-8. I don't see a reason currently why defining the types a little bit 
formal more correct in some platforms should be worth it dealing with all those 
CASTs and/or data conversions not necessary before. Keep in mind that on some 
platforms "char" is already "unsigned char" internally and such.

> C++11 does not allow char literals with highest bit set unless cast
> -------------------------------------------------------------------
>
>                 Key: LOGCXX-400
>                 URL: https://issues.apache.org/jira/browse/LOGCXX-400
>             Project: Log4cxx
>          Issue Type: Bug
>    Affects Versions: 0.10.0
>         Environment: XCode 4.5, clang set to C++11
>            Reporter: Andrew Lazarus
>            Assignee: Thorsten Schöning
>            Priority: Major
>             Fix For: 0.11.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> C++11 has tighter rules for widening of literals. In particular, 0xFF (for 
> example) is an unsigned short unless cast otherwise.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to