[jira] [Commented] (SPARK-1849) sc.textFile does not support non UTF-8 encodings

Apache Spark (JIRA) Wed, 09 May 2018 20:45:28 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16469880#comment-16469880
 ]


Apache Spark commented on SPARK-1849:
-------------------------------------

User 'cqzlxl' has created a pull request for this issue:
https://github.com/apache/spark/pull/21287

> sc.textFile does not support non UTF-8 encodings
> ------------------------------------------------
>
>                 Key: SPARK-1849
>                 URL: https://issues.apache.org/jira/browse/SPARK-1849
>             Project: Spark
>          Issue Type: Bug
>            Reporter: Harry Brundage
>            Priority: Major
>         Attachments: encoding_test
>
>
> I'm trying to process a file which isn't valid UTF-8 data inside hadoop using 
> Spark via {{sc.textFile()}}. Is this possible, and if not, is this a bug that 
> we should fix? It looks like {{HadoopRDD}} uses 
> {{org.apache.hadoop.io.Text.toString}} on all the data it ever reads, which I 
> believe replaces invalid UTF-8 byte sequences with the UTF-8 replacement 
> character, \uFFFD. Some example code mimicking what {{sc.textFile}} does 
> underneath:
> {code}
> scala> sc.textFile(path).collect()(0)
> res8: String = ?pple
> scala> sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], 
> classOf[Text]).map(pair => pair._2.toString).collect()(0).getBytes()
> res9: Array[Byte] = Array(-17, -65, -67, 112, 112, 108, 101)
> scala> sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], 
> classOf[Text]).map(pair => pair._2.getBytes).collect()(0)
> res10: Array[Byte] = Array(-60, 112, 112, 108, 101)
> {code}
> In the above example, the first two snippets show the string representation 
> and byte representation of the example line of text. The string shows a 
> question mark for the replacement character and the bytes reveal the 
> replacement character has been swapped in by {{Text.toString}}. The third 
> snippet shows what happens if you call {{getBytes}} on the {{Text}} object 
> which comes back from hadoop land: we get the real bytes in the file out.
> Now, I think this is a bug, though you may disagree. The text inside my file 
> is perfectly valid iso-8859-1 encoded bytes, which I would like to be able to 
> rescue and re-encode into UTF-8, because I want my application to be smart 
> like that. I think Spark should give me the raw broken string so I can 
> re-encode, but I can't get at the original bytes in order to guess at what 
> the source encoding might be, as they have already been replaced. I'm dealing 
> with data from some CDN access logs which are to put it nicely diversely 
> encoded, but I think a use case Spark should fully support. So, my suggested 
> fix, which I'd like some guidance, is to change {{textFile}} to spit out 
> broken strings by not using {{Text}}'s UTF-8 encoding.
> Further compounding this issue is that my application is actually in PySpark, 
> but we can talk about how bytes fly through to Scala land after this if we 
> agree that this is an issue at all. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1849) sc.textFile does not support non UTF-8 encodings

Reply via email to