[jira] [Commented] (SPARK-13108) Encoding not working with non-ascii compatible encodings (UTF-16/32 etc.)

Hyukjin Kwon (JIRA) Mon, 01 Feb 2016 20:43:16 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-13108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15127656#comment-15127656
 ]


Hyukjin Kwon commented on SPARK-13108:
--------------------------------------

Sure. It needs to re-write Hadoop's LineRecordReader, LineReader and 
TextInputFormat. So, I think this is not super much (I think I can do this 
within a day but due to my schedule, maybe within this week) but would it be 
worth implementing those classes only for several encodings?

For me I am not too sure I will follow your decision.

> Encoding not working with non-ascii compatible encodings (UTF-16/32 etc.)
> -------------------------------------------------------------------------
>
>                 Key: SPARK-13108
>                 URL: https://issues.apache.org/jira/browse/SPARK-13108
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Hyukjin Kwon
>            Priority: Minor
>
> This library uses Hadoop's 
> [{{TextInputFormat}}|https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.java],
>  which uses 
> [{{LineRecordReader}}|https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java].
> According to 
> [MAPREDUCE-232|https://issues.apache.org/jira/browse/MAPREDUCE-232], it looks 
> [{{TextInputFormat}}|https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.java]
>  does not guarantee all encoding types but officially only UTF-8 (as 
> commented in 
> [{{LineRecordReader#L147}}|https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java#L147]).
> According to 
> [MAPREDUCE-232#comment-13183601|https://issues.apache.org/jira/browse/MAPREDUCE-232?focusedCommentId=13183601&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13183601],
>  it still looks fine with most encodings though but without UTF-16/32.
> In more details, 
> I tested this in Max OS. I converted `cars_iso-8859-1.csv` into 
> `cars_utf-16.csv` as below:
> {code}
> iconv -f iso-8859-1 -t utf-16 < cars_iso-8859-1.csv > cars_utf-16.csv
> {code}
> and run the codes below:
> {code}
> val cars = "cars_utf-16.csv"
> sqlContext.read
>   .format("csv")
>   .option("charset", "utf-16")
>   .option("delimiter", 'þ')
>   .load(cars)
>   .show()
> {code}
> This produces a wrong results below:
> {code}
> +----+-----+-----+--------------------+------+
> |year| make|model|             comment|blank�|
> +----+-----+-----+--------------------+------+
> |2012|Tesla|    S|          No comment|     �|
> |   �| null| null|                null|  null|
> |1997| Ford| E350|Go get one now th...|     �|
> |2015|Chevy|Volt�|                null|  null|
> |   �| null| null|                null|  null|
> +----+-----+-----+--------------------+------+
> {code}
> Instead of the correct results below:
> {code}
> +----+-----+-----+--------------------+-----+
> |year| make|model|             comment|blank|
> +----+-----+-----+--------------------+-----+
> |2012|Tesla|    S|          No comment|     |
> |1997| Ford| E350|Go get one now th...|     |
> |2015|Chevy| Volt|                null| null|
> +----+-----+-----+--------------------+-----+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-13108) Encoding not working with non-ascii compatible encodings (UTF-16/32 etc.)

Reply via email to