[
https://issues.apache.org/jira/browse/AVRO-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Keh-Li Sheng updated AVRO-1190:
-------------------------------
Description:
The parser in JsonIO.cc does not handle decoding a multibyte unicode character
into any kind of valid character encoding for a std::string in c++. The
following snippet from JsonParser::tryString() has several flaws:
1. sv is a std::string used as a vector, where each unit is a char
2. a single unicode hex quad encoded in JSON can represent a 16-bit value
3. a unicode hex quad can represent a "high surrogate" character meaning that
it must be combined with the following quad to derive the full unicode code
point
4. \U is not a valid unicode escape for JSON (see
http://www.ietf.org/rfc/rfc4627.txt)
{code:title=JsonIO.cc}
case 'u':
case 'U':
{
unsigned int n = 0;
char e[4];
in_.readBytes(reinterpret_cast<uint8_t*>(e), 4);
for (int i = 0; i < 4; i++) {
n *= 16;
char c = e[i];
if (isdigit(c)) {
n += c - '0';
} else if (c >= 'a' && c <= 'f') {
n += c - 'a' + 10;
} else if (c >= 'A' && c <= 'F') {
n += c - 'A' + 10;
} else {
throw unexpected(c);
}
}
sv.push_back(n);
}
{code}
This code loop creates a temporary int then decodes the quad into it and then
simply pushes the int (which may be a 16-bit value) onto the std::string. This
essentially means that the JSON parser does not decode any unicode characters.
For example, this JSON string:
{noformat}
"Dress up if you dare! Free cover all night! \uD83C\uDF83\uD83D\uDC7B"
{noformat}
results in a decoded byte sequence for the last 4 characters:
{noformat}
3C 83 3D 7B 00
{noformat}
where you can see that it simply drops the high order bytes. In this particular
example, \uD83C is a high-surrogate character which requires some additional
handling. I am not sure what users of the c++ library expect the encoding to
be, but given that we are working with json and given that avro c++ uses char
instead of wchar, I would assume users would expect a UTF-8 encoded string.
However, I could be wrong since it seems that the *encoder* likes to output
UTF-16BE. There are many examples of decoders that handle this string properly
- I found this one helpful while implementing a fix:
http://rishida.net/tools/conversion/
For basics on UTF-8 http://www.utf-8.com/
was:
The parser in JsonIO.cc does not handle decoding a multibyte unicode character
into any kind of valid character encoding for a std::string in c++. The
following snippet from JsonParser::tryString() has several flaws:
1. sv is a std::string used as a vector, where each unit is a char
2. a single unicode hex quad encoded in JSON can represent a 16-bit value
3. a unicode hex quad can represent a "high surrogate" character meaning that
it must be combined with the following quad to derive the full unicode code
point
4. \U is not a valid unicode escape for JSON (see
http://www.ietf.org/rfc/rfc4627.txt)
{code:title=JsonIO.cc}
case 'u':
case 'U':
{
unsigned int n = 0;
char e[4];
in_.readBytes(reinterpret_cast<uint8_t*>(e), 4);
for (int i = 0; i < 4; i++) {
n *= 16;
char c = e[i];
if (isdigit(c)) {
n += c - '0';
} else if (c >= 'a' && c <= 'f') {
n += c - 'a' + 10;
} else if (c >= 'A' && c <= 'F') {
n += c - 'A' + 10;
} else {
throw unexpected(c);
}
}
sv.push_back(n);
}
{code}
This code loop creates a temporary int then decodes the quad into it and then
simply pushes the int (which may be a 16-bit value) onto the std::string. This
essentially means that the JSON parser does not decode any unicode characters.
For example, this JSON string:
{noformat}
"Dress up if you dare! Free cover all night! \uD83C\uDF83\uD83D\uDC7B"
{noformat}
results in a decoded byte sequence for the last 4 characters:
{noformat}
3C 83 3D 7B 00
{noformat}
where you can see that it simply drops the high order bytes. In this particular
example, \uD83C is a high-surrogate character which requires some additional
handling. I am not sure what users of the c++ library expect the encoding to
be, but given that we are working with json, I would assume users would expect
a UTF-8 encoded string. There are many examples of decoders that handle this
string properly - I found this one helpful while implementing a fix:
http://rishida.net/tools/conversion/
For basics on UTF-8 http://www.utf-8.com/
> C++ json parser fails to decode multibyte unicode code points
> -------------------------------------------------------------
>
> Key: AVRO-1190
> URL: https://issues.apache.org/jira/browse/AVRO-1190
> Project: Avro
> Issue Type: Bug
> Components: c++
> Affects Versions: 1.7.0
> Reporter: Keh-Li Sheng
>
> The parser in JsonIO.cc does not handle decoding a multibyte unicode
> character into any kind of valid character encoding for a std::string in c++.
> The following snippet from JsonParser::tryString() has several flaws:
> 1. sv is a std::string used as a vector, where each unit is a char
> 2. a single unicode hex quad encoded in JSON can represent a 16-bit value
> 3. a unicode hex quad can represent a "high surrogate" character meaning that
> it must be combined with the following quad to derive the full unicode code
> point
> 4. \U is not a valid unicode escape for JSON (see
> http://www.ietf.org/rfc/rfc4627.txt)
> {code:title=JsonIO.cc}
> case 'u':
> case 'U':
> {
> unsigned int n = 0;
> char e[4];
> in_.readBytes(reinterpret_cast<uint8_t*>(e), 4);
> for (int i = 0; i < 4; i++) {
> n *= 16;
> char c = e[i];
> if (isdigit(c)) {
> n += c - '0';
> } else if (c >= 'a' && c <= 'f') {
> n += c - 'a' + 10;
> } else if (c >= 'A' && c <= 'F') {
> n += c - 'A' + 10;
> } else {
> throw unexpected(c);
> }
> }
> sv.push_back(n);
> }
> {code}
> This code loop creates a temporary int then decodes the quad into it and then
> simply pushes the int (which may be a 16-bit value) onto the std::string.
> This essentially means that the JSON parser does not decode any unicode
> characters. For example, this JSON string:
> {noformat}
> "Dress up if you dare! Free cover all night! \uD83C\uDF83\uD83D\uDC7B"
> {noformat}
> results in a decoded byte sequence for the last 4 characters:
> {noformat}
> 3C 83 3D 7B 00
> {noformat}
> where you can see that it simply drops the high order bytes. In this
> particular example, \uD83C is a high-surrogate character which requires some
> additional handling. I am not sure what users of the c++ library expect the
> encoding to be, but given that we are working with json and given that avro
> c++ uses char instead of wchar, I would assume users would expect a UTF-8
> encoded string. However, I could be wrong since it seems that the *encoder*
> likes to output UTF-16BE. There are many examples of decoders that handle
> this string properly - I found this one helpful while implementing a fix:
> http://rishida.net/tools/conversion/
> For basics on UTF-8 http://www.utf-8.com/
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira