Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20796#discussion_r175589638
  
    --- Diff: 
common/unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java 
---
    @@ -791,4 +795,21 @@ public void trimRightWithTrimString() {
         assertEquals(fromString("头"), 
fromString("头a???/").trimRight(fromString("数?/*&^%a")));
         assertEquals(fromString("头"), fromString("头数b数数 
[").trimRight(fromString(" []数b")));
       }
    +
    +  @Test
    +  public void skipWrongFirstByte() {
    +    int[] wrongFirstBytes = {
    --- End diff --
    
    The bytes are not filtered by UTF8String methods. For instance, in the case 
of csv datasource  the invalid bytes are just passed to the final result. See 
https://issues.apache.org/jira/browse/SPARK-23649
    
    I have created a separate ticket to fix the issue: 
https://issues.apache.org/jira/browse/SPARK-23741 . 
    
    I am not sure that the issue of output of wrong UTF-8 chars should be 
addressed by this PR (this pr just fixes crashes on wrong input) because it 
could impact on users and other Spark components. Need to discuss it and test 
it carefully.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to