[GitHub] spark pull request #20796: [SPARK-23649][SQL] Skipping chars disallowed in U...

MaxGekk Mon, 19 Mar 2018 14:21:07 -0700

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20796#discussion_r175589638
  
    --- Diff: 
common/unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java 
---
    @@ -791,4 +795,21 @@ public void trimRightWithTrimString() {
         assertEquals(fromString("å¤´"), 
fromString("å¤´a???/").trimRight(fromString("æ°?/*&^%a")));
         assertEquals(fromString("å¤´"), fromString("å¤´æ°bæ°æ° 
[").trimRight(fromString(" []æ°b")));
       }
    +
    +  @Test
    +  public void skipWrongFirstByte() {
    +    int[] wrongFirstBytes = {
    --- End diff --
    
    The bytes are not filtered by UTF8String methods. For instance, in the case 
of csv datasource  the invalid bytes are just passed to the final result. See 
https://issues.apache.org/jira/browse/SPARK-23649
    
    I have created a separate ticket to fix the issue: 
https://issues.apache.org/jira/browse/SPARK-23741 . 
    
    I am not sure that the issue of output of wrong UTF-8 chars should be 
addressed by this PR (this pr just fixes crashes on wrong input) because it 
could impact on users and other Spark components. Need to discuss it and test 
it carefully.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20796: [SPARK-23649][SQL] Skipping chars disallowed in U...

Reply via email to