edponce commented on a change in pull request #11233:
URL: https://github.com/apache/arrow/pull/11233#discussion_r718817723



##########
File path: cpp/src/arrow/compute/kernels/scalar_string_test.cc
##########
@@ -724,7 +724,7 @@ TYPED_TEST(TestStringKernels, IsUpperAscii) {
                    "[false, null, false, true, true, false, false]");
 }
 
-TYPED_TEST(TestStringKernels, MatchSubstring) {
+TYPED_TEST(TestBinaryKernels, MatchSubstring) {

Review comment:
       Ok, so I noticed that there are not binary (invalid UTF8) tests for 
string functions. As a starting point, I chose `binary_length` as an "easy" 
function to test,
   ```c++
   // These cases pass successfully
   this->CheckUnary("binary_length", "[\"\x84\xf0\x2f\", 
\"\xff\x9b\xc3\xbb\"]", this->offset_type(), "[3, 4]");
   ```
   but the JSON parser errors out in many cases (e.g., `\x84\x00`, `\x84\x01`) 
with one of the following messages:
   ```c++
   Invalid: Expected string or null, got JSON type 6
   Invalid: JSON parse error at offset 3: Invalid escape character in string.
   Invalid: JSON parse error at offset 3: Missing a closing quotation mark in 
string.
   ...
   ```
   I suspect that the successful cases are valid UTF-8 (need to check), but the 
failed tests are definitely not valid UTF-8.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to