Maxim Gekk created SPARK-23649:
----------------------------------

             Summary: CSV schema inferring fails on some UTF-8 chars
                 Key: SPARK-23649
                 URL: https://issues.apache.org/jira/browse/SPARK-23649
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.3.0
            Reporter: Maxim Gekk


Schema inferring of CSV files fails if the file contains a char starts from 
*0xFF.* 
{code:java}
spark.read.option("header", "true").csv("utf8xFF.csv")
{code}
{code:java}
java.lang.ArrayIndexOutOfBoundsException: 63
  at 
org.apache.spark.unsafe.types.UTF8String.numBytesForFirstByte(UTF8String.java:191)
  at org.apache.spark.unsafe.types.UTF8String.numChars(UTF8String.java:206)
{code}
Here is content of the file:
{code:java}
hexdump -C ~/tmp/utf8xFF.csv
00000000  63 68 61 6e 6e 65 6c 2c  63 6f 64 65 0d 0a 55 6e  |channel,code..Un|
00000010  69 74 65 64 2c 31 32 33  0d 0a 41 42 47 55 4e ff  |ited,123..ABGUN.|
00000020  2c 34 35 36 0d                                    |,456.|
00000025
{code}
Schema inferring doesn't fail in multiline mode:
{code}
spark.read.option("header", "true").option("multiline", 
"true").csv("utf8xFF.csv")
{code}
{code:java}
+-------+-----+
|channel|code
+-------+-----+
| United| 123
| ABGUN�| 456
+-------+-----+
{code}
and Spark is able to read the csv file if the schema is specified:
{code}
import org.apache.spark.sql.types._
val schema = new StructType().add("channel", StringType).add("code", StringType)
spark.read.option("header", "true").schema(schema).csv("utf8xFF.csv").show
{code}
{code:java}
+-------+----+
|channel|code|
+-------+----+
| United| 123|
| ABGUN�| 456|
+-------+----+
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to