GitHub user MaxGekk opened a pull request:

    https://github.com/apache/spark/pull/20796

    [SPARK-23649][SQL] Prevent crashes on schema inferring of CSV containing 
wrong UTF-8 chars

    ## What changes were proposed in this pull request?
    
    The mapping of UTF-8 char's first byte to char's size doesn't cover whole 
range 0-255. It is defined only for 0-253:
    
https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L60-L65
    
https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L190
    
    If the first byte of a char is 253-255, IndexOutOfBoundsException is 
thrown. Besides of that values for 244-252 are not correct according to recent 
unicode standard for UTF-8: 
http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf
    
    As a consequence of the exception above, the length of input string in 
UTF-8 encoding cannot be calculated if the string contains chars started from 
253 code. It is visible on user's side as for example crashing of schema 
inferring of csv file which contains such chars but the file can be read if the 
schema is specified explicitly or if the mode set to multiline.
    
    The proposed changes build correct mapping of first byte of UTF-8 char to 
its size (now it covers all cases) and skip disallowed chars (counts it as one 
octet).
    
    ## How was this patch tested?
    
    Added a test and a file with a char which is disallowed in UTF-8 - 0xFF.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/MaxGekk/spark-1 skip-wrong-utf8-chars

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20796.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20796
    
----
commit 6d6b3ca8eedc7bf6381cb8d746fe8a3ee0a281b2
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-02-27T17:41:01Z

    Test: skip bytes disallowed in UTF-8

commit 0f474a033f152340519b10676026b80b7593c2f5
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-02-27T17:43:53Z

    Making correct map of first byte to char size in UTF-8 and skip bytes 
disallowed in UTF-8

commit 2ee661618a40544a8cbf3ac0794a1a6b86b6b62d
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-02-28T13:26:21Z

    Check inferred schema and returned bad string

commit d6c5f02ea1a08513a54ea9f3b30986dd92188b3e
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-02-28T14:04:22Z

    The test csv was simplified

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to