[ https://issues.apache.org/jira/browse/SPARK-20336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
HanCheol Cho updated SPARK-20336: --------------------------------- Description: I used spark.read.csv() method with wholeFile=True option to load data that has multi-line records. However, non-ASCII characters are not properly loaded. The following is a sample data for test: {code:none} col1,col2,col3 1,a,text 2,b,テキスト 3,c,텍스트 4,d,"text テキスト 텍스트" 5,e,last {code} When it is loaded without wholeFile=True option, non-ASCII characters are shown correctly although multi-line records are parsed incorrectly as follows: {code:none} testdf_default = spark.read.csv("test.encoding.csv", header=True) testdf_default.show() +----+----+----+ |col1|col2|col3| +----+----+----+ | 1| a|text| | 2| b|テキスト| | 3| c| 텍스트| | 4| d|text| |テキスト|null|null| | 텍스트"|null|null| | 5| e|last| +----+----+----+ {code} When wholeFile=True option is used, non-ASCII characters are broken as follows: {code:none} testdf_wholefile = spark.read.csv("test.encoding.csv", header=True, wholeFile=True) testdf_wholefile.show() +----+----+--------------------+ |col1|col2| col3| +----+----+--------------------+ | 1| a| text| | 2| b| ������������| | 3| c| ���������| | 4| d|text ������������...| | 5| e| last| +----+----+--------------------+ {code} The result is same even if I use encoding="UTF-8" option with wholeFile=True. was: I used spark.read.csv() method with wholeFile=True option to load data that has multi-line records. However, non-ASCII characters are not properly loaded. The following is a sample data for test: {code:none} col1,col2,col3 1,a,text 2,b,テキスト 3,c,텍스트 4,d,"text テキスト 텍스트 5,e,last {code} When it is loaded without wholeFile=True option, non-ASCII characters are shown correctly although multi-line records are parsed incorrectly as follows: {code:none} testdf_default = spark.read.csv("test.encoding.csv", header=True) testdf_default.show() +----+----+----+ |col1|col2|col3| +----+----+----+ | 1| a|text| | 2| b|テキスト| | 3| c| 텍스트| | 4| d|text| |テキスト|null|null| | 텍스트|null|null| | 5| e|last| +----+----+----+ {code} When wholeFile=True option is used, non-ASCII characters are broken as follows: {code:none} testdf_wholefile = spark.read.csv("test.encoding.csv", header=True, wholeFile=True) testdf_wholefile.show() +----+----+--------------------+ |col1|col2| col3| +----+----+--------------------+ | 1| a| text| | 2| b| ������������| | 3| c| ���������| | 4| d|text ������������...| +----+----+--------------------+ {code} The result is same even if I use encoding="UTF-8" option with wholeFile=True. > spark.read.csv() with wholeFile=True option fails to read non ASCII unicode > characters > -------------------------------------------------------------------------------------- > > Key: SPARK-20336 > URL: https://issues.apache.org/jira/browse/SPARK-20336 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.2.0 > Environment: Spark 2.2.0 (master branch is downloaded from Github) > PySpark > Reporter: HanCheol Cho > > I used spark.read.csv() method with wholeFile=True option to load data that > has multi-line records. > However, non-ASCII characters are not properly loaded. > The following is a sample data for test: > {code:none} > col1,col2,col3 > 1,a,text > 2,b,テキスト > 3,c,텍스트 > 4,d,"text > テキスト > 텍스트" > 5,e,last > {code} > When it is loaded without wholeFile=True option, non-ASCII characters are > shown correctly although multi-line records are parsed incorrectly as follows: > {code:none} > testdf_default = spark.read.csv("test.encoding.csv", header=True) > testdf_default.show() > +----+----+----+ > |col1|col2|col3| > +----+----+----+ > | 1| a|text| > | 2| b|テキスト| > | 3| c| 텍스트| > | 4| d|text| > |テキスト|null|null| > | 텍스트"|null|null| > | 5| e|last| > +----+----+----+ > {code} > When wholeFile=True option is used, non-ASCII characters are broken as > follows: > {code:none} > testdf_wholefile = spark.read.csv("test.encoding.csv", header=True, > wholeFile=True) > testdf_wholefile.show() > +----+----+--------------------+ > |col1|col2| col3| > +----+----+--------------------+ > | 1| a| text| > | 2| b| ������������| > | 3| c| ���������| > | 4| d|text > ������������...| > | 5| e| last| > +----+----+--------------------+ > {code} > The result is same even if I use encoding="UTF-8" option with wholeFile=True. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org