Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/20796#discussion_r175589638 --- Diff: common/unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java --- @@ -791,4 +795,21 @@ public void trimRightWithTrimString() { assertEquals(fromString("头"), fromString("头a???/").trimRight(fromString("æ°?/*&^%a"))); assertEquals(fromString("头"), fromString("头æ°bæ°æ° [").trimRight(fromString(" []æ°b"))); } + + @Test + public void skipWrongFirstByte() { + int[] wrongFirstBytes = { --- End diff -- The bytes are not filtered by UTF8String methods. For instance, in the case of csv datasource the invalid bytes are just passed to the final result. See https://issues.apache.org/jira/browse/SPARK-23649 I have created a separate ticket to fix the issue: https://issues.apache.org/jira/browse/SPARK-23741 . I am not sure that the issue of output of wrong UTF-8 chars should be addressed by this PR (this pr just fixes crashes on wrong input) because it could impact on users and other Spark components. Need to discuss it and test it carefully.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org