Himanshu Arora created SPARK-38801:
--------------------------------------

             Summary: ISO-8859-1 encoding doesn't work for text format
                 Key: SPARK-38801
                 URL: https://issues.apache.org/jira/browse/SPARK-38801
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 3.2.1
         Environment: I tested this issue on Databricks runtime 10.3 (spark 
3.2.1, scala 2.12)
            Reporter: Himanshu Arora


When reading text files from spark which are not in UTF-8 charset it doesn't 
work well for foreign language characters (for ex. French chars like è and é). 
They are all replaced by �. In my case the text files were in ISO-8859-1 
encoding.

After digging into docs, it seems that spark still uses Hadoop's 
LineRecordReader class for text format which only supports UTF-8. Here's the 
source code of that class: 
[LineRecordReader.java|https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java#L154]

 

You can see this issue in the screenshot below:

!image-2022-04-06-09-30-21-751.png!

As you can see the French word *données* is read as {*}donn�es{*}. The work 
*Clôturé* is read as {*}Cl�tur�.{*}{*}{*}

 

I also read the same text file as CSV format while providing the correct 
charset value and it works fine in this case as you can see the screenshot 
below:

!image-2022-04-06-09-31-45-062.png!

 

So this issue is specifically for text format. Therefore reporting this issue. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to