[ https://issues.apache.org/jira/browse/SPARK-13108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15127656#comment-15127656 ]
Hyukjin Kwon commented on SPARK-13108: -------------------------------------- Sure. It needs to re-write Hadoop's LineRecordReader, LineReader and TextInputFormat. So, I think this is not super much (I think I can do this within a day but due to my schedule, maybe within this week) but would it be worth implementing those classes only for several encodings? For me I am not too sure I will follow your decision. > Encoding not working with non-ascii compatible encodings (UTF-16/32 etc.) > ------------------------------------------------------------------------- > > Key: SPARK-13108 > URL: https://issues.apache.org/jira/browse/SPARK-13108 > Project: Spark > Issue Type: Sub-task > Components: SQL > Affects Versions: 2.0.0 > Reporter: Hyukjin Kwon > Priority: Minor > > This library uses Hadoop's > [{{TextInputFormat}}|https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.java], > which uses > [{{LineRecordReader}}|https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java]. > According to > [MAPREDUCE-232|https://issues.apache.org/jira/browse/MAPREDUCE-232], it looks > [{{TextInputFormat}}|https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.java] > does not guarantee all encoding types but officially only UTF-8 (as > commented in > [{{LineRecordReader#L147}}|https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java#L147]). > According to > [MAPREDUCE-232#comment-13183601|https://issues.apache.org/jira/browse/MAPREDUCE-232?focusedCommentId=13183601&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13183601], > it still looks fine with most encodings though but without UTF-16/32. > In more details, > I tested this in Max OS. I converted `cars_iso-8859-1.csv` into > `cars_utf-16.csv` as below: > {code} > iconv -f iso-8859-1 -t utf-16 < cars_iso-8859-1.csv > cars_utf-16.csv > {code} > and run the codes below: > {code} > val cars = "cars_utf-16.csv" > sqlContext.read > .format("csv") > .option("charset", "utf-16") > .option("delimiter", 'þ') > .load(cars) > .show() > {code} > This produces a wrong results below: > {code} > +----+-----+-----+--------------------+------+ > |year| make|model| comment|blank�| > +----+-----+-----+--------------------+------+ > |2012|Tesla| S| No comment| �| > | �| null| null| null| null| > |1997| Ford| E350|Go get one now th...| �| > |2015|Chevy|Volt�| null| null| > | �| null| null| null| null| > +----+-----+-----+--------------------+------+ > {code} > Instead of the correct results below: > {code} > +----+-----+-----+--------------------+-----+ > |year| make|model| comment|blank| > +----+-----+-----+--------------------+-----+ > |2012|Tesla| S| No comment| | > |1997| Ford| E350|Go get one now th...| | > |2015|Chevy| Volt| null| null| > +----+-----+-----+--------------------+-----+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org