Re: [PR] fix: handle UTF-8 bytes correctly in JSON parser control character check [tvm-ffi]

via GitHub Wed, 11 Feb 2026 17:17:36 -0800


gemini-code-assist[bot] commented on code in PR #442:
URL: https://github.com/apache/tvm-ffi/pull/442#discussion_r2796310044



##########
src/ffi/extra/json_parser.cc:
##########
@@ -341,7 +341,7 @@ class JSONParserContext {
     // copy over the prefix that was already parsed
     std::string out_str(start_pos + 1, cur_ - start_pos - 1);
     while (cur_ != end_) {
-      if (*cur_ < ' ') {
+      if (static_cast<unsigned char>(*cur_) < ' ') {

Review Comment:
   ![medium](https://www.gstatic.com/codereviewagent/medium-priority.svg)
   
   For consistency with other parts of this file (e.g., `Peek()` and 
`Parse4Hex`), consider using `*reinterpret_cast<const uint8_t*>(cur_)` to treat 
the byte as an unsigned value. This would align with the approach used 
elsewhere in the parser for handling character bytes.
   
   ```c
         if (*reinterpret_cast<const uint8_t*>(cur_) < ' ') {
   ```



##########
src/ffi/extra/json_parser.cc:
##########
@@ -131,7 +131,7 @@ class JSONParserContext {
         ++cur_;
         return true;
       }
-      if (*cur_ < ' ' || *cur_ == '\\') {
+      if (static_cast<unsigned char>(*cur_) < ' ' || *cur_ == '\\') {

Review Comment:
   ![medium](https://www.gstatic.com/codereviewagent/medium-priority.svg)
   
   For consistency with other parts of this file (e.g., `Peek()` and 
`Parse4Hex`), consider using `*reinterpret_cast<const uint8_t*>(cur_)` to treat 
the byte as an unsigned value. While `static_cast<unsigned char>(*cur_)` is 
correct, using a consistent approach for byte interpretation can improve code 
clarity and maintainability.
   
   ```c
         if (*reinterpret_cast<const uint8_t*>(cur_) < ' ' || *cur_ == '\') {
   ```



##########
tests/cpp/extra/test_json_parser.cc:
##########
@@ -290,6 +290,12 @@ TEST(JSONParser, UnicodeEdgeCases) {
             u8"\U0001F600\U0001F601");
 }
 
+TEST(JSONParser, RawUTF8Bytes) {
+  // Regression test: raw UTF-8 bytes (>= 0x80) must not be rejected as 
control characters.
+  // This failed when the parser used signed char comparison: *cur_ < ' '
+  EXPECT_EQ(json::Parse("\"\xE4\xB8\xAD\xE6\x96\x87\"").cast<String>(), 
"\xE4\xB8\xAD\xE6\x96\x87");

Review Comment:
   ![medium](https://www.gstatic.com/codereviewagent/medium-priority.svg)
   
   This regression test is great for verifying the fix. To make it more robust, 
consider adding a few more assertions to cover a wider range of scenarios. 
Using `u8` string literals can also improve readability while testing the same 
underlying byte sequences.
   
   ```suggestion
     EXPECT_EQ(json::Parse(u8"\"中文\"").cast<String>(), u8"中文");
     // Test with a mix of ASCII and non-ASCII characters.
     EXPECT_EQ(json::Parse(u8"\"a中b\"").cast<String>(), u8"a中b");
     // Test with a two-byte UTF-8 character (cent sign).
     EXPECT_EQ(json::Parse(u8"\"¢\"").cast<String>(), u8"¢");
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] fix: handle UTF-8 bytes correctly in JSON parser control character check [tvm-ffi]

Reply via email to