[
https://issues.apache.org/jira/browse/SPARK-13108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon updated SPARK-13108:
---------------------------------
Description:
This library uses Hadoop's
[{{TextInputFormat}}|https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.java],
which uses
[{{LineRecordReader}}|https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java].
According to
[MAPREDUCE-232|https://issues.apache.org/jira/browse/MAPREDUCE-232], it looks
[{{TextInputFormat}}|https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.java]
does not guarantee all encoding types but officially only UTF-8 (as commented
in
[{{LineRecordReader#L147}}|https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java#L147]).
According to
[MAPREDUCE-232#comment-13183601|https://issues.apache.org/jira/browse/MAPREDUCE-232?focusedCommentId=13183601&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13183601],
it still looks fine with most encodings though but without UTF-16/32.
In more details,
I tested this in Max OS. I converted `cars_iso-8859-1.csv` into
`cars_utf-16.csv` as below:
{code}
iconv -f iso-8859-1 -t utf-16 < cars_iso-8859-1.csv > cars_utf-16.csv
{code}
and run the codes below:
{code}
val cars = "cars_utf-16.csv"
sqlContext.read
.format("csv")
.option("charset", "utf-16")
.option("delimiter", 'þ')
.load(cars)
.show()
{code}
This produces a wrong results below:
{code}
+----+-----+-----+--------------------+------+
|year| make|model| comment|blank�|
+----+-----+-----+--------------------+------+
|2012|Tesla| S| No comment| �|
| �| null| null| null| null|
|1997| Ford| E350|Go get one now th...| �|
|2015|Chevy|Volt�| null| null|
| �| null| null| null| null|
+----+-----+-----+--------------------+------+
{code}
Instead of the correct results below:
{code}
+----+-----+-----+--------------------+-----+
|year| make|model| comment|blank|
+----+-----+-----+--------------------+-----+
|2012|Tesla| S| No comment| |
|1997| Ford| E350|Go get one now th...| |
|2015|Chevy| Volt| null| null|
+----+-----+-----+--------------------+-----+
{code}
was:
This library uses Hadoop's
[{{TextInputFormat}}|https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.java],
which uses
[{{LineRecordReader}}|https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java].
According to
[MAPREDUCE-232|https://issues.apache.org/jira/browse/MAPREDUCE-232], it looks
[{{TextInputFormat}}|https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.java]
does not guarantee all encoding types but officially only UTF-8 (as commented
in
[{{LineRecordReader#L147}}|https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java#L147]).
According to
[MAPREDUCE-232#comment-13183601|https://issues.apache.org/jira/browse/MAPREDUCE-232?focusedCommentId=13183601&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13183601],
it still looks fine with most encodings though but without UTF-16/32.
In more details,
I tested this in Max OS. I converted `cars_iso-8859-1.csv` into
`cars_utf-16.csv` as below:
{code}
iconv -f iso-8859-1 -t utf-16 < cars_iso-8859-1.csv > cars_utf-16.csv
{code}
and run the codes below:
{code}
val cars = "src/test/resources/cars_utf-16.csv"
sqlContext.csvFile(cars, parserLib = parserLib, charset = "utf-16", delimiter =
'þ').show()
{code}
This produces a wrong results below:
{code}
+----+-----+-----+--------------------+------+
|year| make|model| comment|blank�|
+----+-----+-----+--------------------+------+
|2012|Tesla| S| No comment| �|
| �| null| null| null| null|
|1997| Ford| E350|Go get one now th...| �|
|2015|Chevy|Volt�| null| null|
| �| null| null| null| null|
+----+-----+-----+--------------------+------+
{code}
Instead of the correct results below:
{code}
+----+-----+-----+--------------------+-----+
|year| make|model| comment|blank|
+----+-----+-----+--------------------+-----+
|2012|Tesla| S| No comment| |
|1997| Ford| E350|Go get one now th...| |
|2015|Chevy| Volt| null| null|
+----+-----+-----+--------------------+-----+
{code}
> Encoding not working with non-ascii compatible encodings (UTF-16/32 etc.)
> -------------------------------------------------------------------------
>
> Key: SPARK-13108
> URL: https://issues.apache.org/jira/browse/SPARK-13108
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Affects Versions: 2.0.0
> Reporter: Hyukjin Kwon
> Priority: Minor
>
> This library uses Hadoop's
> [{{TextInputFormat}}|https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.java],
> which uses
> [{{LineRecordReader}}|https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java].
> According to
> [MAPREDUCE-232|https://issues.apache.org/jira/browse/MAPREDUCE-232], it looks
> [{{TextInputFormat}}|https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.java]
> does not guarantee all encoding types but officially only UTF-8 (as
> commented in
> [{{LineRecordReader#L147}}|https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java#L147]).
> According to
> [MAPREDUCE-232#comment-13183601|https://issues.apache.org/jira/browse/MAPREDUCE-232?focusedCommentId=13183601&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13183601],
> it still looks fine with most encodings though but without UTF-16/32.
> In more details,
> I tested this in Max OS. I converted `cars_iso-8859-1.csv` into
> `cars_utf-16.csv` as below:
> {code}
> iconv -f iso-8859-1 -t utf-16 < cars_iso-8859-1.csv > cars_utf-16.csv
> {code}
> and run the codes below:
> {code}
> val cars = "cars_utf-16.csv"
> sqlContext.read
> .format("csv")
> .option("charset", "utf-16")
> .option("delimiter", 'þ')
> .load(cars)
> .show()
> {code}
> This produces a wrong results below:
> {code}
> +----+-----+-----+--------------------+------+
> |year| make|model| comment|blank�|
> +----+-----+-----+--------------------+------+
> |2012|Tesla| S| No comment| �|
> | �| null| null| null| null|
> |1997| Ford| E350|Go get one now th...| �|
> |2015|Chevy|Volt�| null| null|
> | �| null| null| null| null|
> +----+-----+-----+--------------------+------+
> {code}
> Instead of the correct results below:
> {code}
> +----+-----+-----+--------------------+-----+
> |year| make|model| comment|blank|
> +----+-----+-----+--------------------+-----+
> |2012|Tesla| S| No comment| |
> |1997| Ford| E350|Go get one now th...| |
> |2015|Chevy| Volt| null| null|
> +----+-----+-----+--------------------+-----+
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]