Github user MaxGekk commented on a diff in the pull request:
https://github.com/apache/spark/pull/20796#discussion_r175589638
--- Diff:
common/unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java
---
@@ -791,4 +795,21 @@ public void trimRightWithTrimString() {
assertEquals(fromString("头"),
fromString("头a???/").trimRight(fromString("�/*&^%a")));
assertEquals(fromString("头"), fromString("头æ°bæ°æ°
[").trimRight(fromString(" []æ°b")));
}
+
+ @Test
+ public void skipWrongFirstByte() {
+ int[] wrongFirstBytes = {
--- End diff --
The bytes are not filtered by UTF8String methods. For instance, in the case
of csv datasource the invalid bytes are just passed to the final result. See
https://issues.apache.org/jira/browse/SPARK-23649
I have created a separate ticket to fix the issue:
https://issues.apache.org/jira/browse/SPARK-23741 .
I am not sure that the issue of output of wrong UTF-8 chars should be
addressed by this PR (this pr just fixes crashes on wrong input) because it
could impact on users and other Spark components. Need to discuss it and test
it carefully.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]