[jira] [Updated] (SPARK-1849) Broken UTF-8 encoded data gets character replacements and thus can't be "fixed"

Harry Brundage (JIRA) Fri, 16 May 2014 17:00:14 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Harry Brundage updated SPARK-1849:
----------------------------------

    Description: 
I'm trying to process a file which isn't valid UTF-8 data inside hadoop using 
Spark via {{sc.textFile()}}. Is this possible, and if not, is this a bug that 
we should fix? It looks like {{HadoopRDD}} uses 
{{org.apache.hadoop.io.Text.toString}} on all the data it ever reads, which I 
believe replaces invalid UTF-8 byte sequences with the UTF-8 replacement 
character, \uFFFD. Some example code mimicking what {{sc.textFile}} does 
underneath:

{code}
scala> sc.textFile(path).collect()(0)
res8: String = ?pple

scala> sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], 
classOf[Text]).map(pair => pair._2.toString).collect()(0).getBytes()
res9: Array[Byte] = Array(-17, -65, -67, 112, 112, 108, 101)

scala> sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], 
classOf[Text]).map(pair => pair._2.getBytes).collect()(0)
res10: Array[Byte] = Array(-60, 112, 112, 108, 101)
{code}

In the above example, the first two snippets show the string representation and 
byte representation of the example line of text. The string shows a question 
mark for the replacement character and the bytes reveal the replacement 
character has been swapped in by {{Text.toString}}. The third snippet shows 
what happens if you call {{getBytes}} on the {{Text}} object which comes back 
from hadoop land: we get the real bytes in the file out.

Now, I think this is a bug, though you may disagree. The text inside my file is 
perfectly valid iso-8859-1 encoded bytes, which I would like to be able to 
rescue and re-encode into UTF-8, because I want my application to be smart like 
that. I think Spark should give me the raw broken string so I can re-encode, 
but I can't get at the original bytes in order to guess at what the source 
encoding might be, as they have already been replaced. I'm dealing with data 
from some CDN access logs which are to put it nicely diversely encoded, but I 
think a use case Spark should fully support. So, my suggested fix, which I'd 
like some guidance, is to change {{textFile}} to spit out broken strings by not 
using {{Text}}'s UTF-8 encoding.

Further compounding this issue is that my application is actually in PySpark, 
but we can talk about how bytes fly through to Scala land after this if we 
agree that this is an issue at all. 

  was:
I'm trying to process a file which isn't valid UTF-8 data inside hadoop using 
Spark via {{sc.textFile()}}. Is this possible, and if not, is this a bug that 
we should fix? It looks like {{HadoopRDD}} uses 
{{org.apache.hadoop.io.Text.toString}} on all the data it ever reads, which I 
believe replaces invalid UTF-8 byte sequences with the UTF-8 replacement 
character, \uFFFD. Some example code mimicking what {{sc.textFile}} does 
underneath:

{code}
scala> sc.textFile(path).collect()(0)
res8: String = ?pple

scala> sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], 
classOf[Text]).map(pair => pair._2.toString).collect()(0).getBytes()
res9: Array[Byte] = Array(-17, -65, -67, 112, 112, 108, 101)

scala> sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], 
classOf[Text]).map(pair => pair._2.getBytes).collect()(0)
res10: Array[Byte] = Array(-60, 112, 112, 108, 101)
{code}

In the above example, the first two snippets show the string representation and 
byte representation of the example line of text. The third snippet shows what 
happens if you call {{getBytes}} on the {{Text}} object which comes back from 
hadoop land: we get the real bytes in the file out.

Now, I think this is a bug, though you may disagree. The text inside my file is 
perfectly valid iso-8859-1 encoded bytes, which I would like to be able to 
rescue and re-encode into UTF-8, because I want my application to be smart like 
that. I think Spark should give me the raw broken string so I can re-encode, 
but I can't get at the original bytes in order to guess at what the source 
encoding might be, as they have already been replaced. I'm dealing with data 
from some CDN access logs which are to put it nicely diversely encoded, but I 
think a use case Spark should fully support. So, my suggested fix, which I'd 
like some guidance, is to change {{textFile}} to spit out broken strings by not 
using {{Text}}'s UTF-8 encoding.

Further compounding this issue is that my application is actually in PySpark, 
but we can talk about how bytes fly through to Scala land after this if we 
agree that this is an issue at all. 


> Broken UTF-8 encoded data gets character replacements and thus can't be 
> "fixed"
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-1849
>                 URL: https://issues.apache.org/jira/browse/SPARK-1849
>             Project: Spark
>          Issue Type: Bug
>            Reporter: Harry Brundage
>             Fix For: 1.0.0, 0.9.1
>
>         Attachments: encoding_test
>
>
> I'm trying to process a file which isn't valid UTF-8 data inside hadoop using 
> Spark via {{sc.textFile()}}. Is this possible, and if not, is this a bug that 
> we should fix? It looks like {{HadoopRDD}} uses 
> {{org.apache.hadoop.io.Text.toString}} on all the data it ever reads, which I 
> believe replaces invalid UTF-8 byte sequences with the UTF-8 replacement 
> character, \uFFFD. Some example code mimicking what {{sc.textFile}} does 
> underneath:
> {code}
> scala> sc.textFile(path).collect()(0)
> res8: String = ?pple
> scala> sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], 
> classOf[Text]).map(pair => pair._2.toString).collect()(0).getBytes()
> res9: Array[Byte] = Array(-17, -65, -67, 112, 112, 108, 101)
> scala> sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], 
> classOf[Text]).map(pair => pair._2.getBytes).collect()(0)
> res10: Array[Byte] = Array(-60, 112, 112, 108, 101)
> {code}
> In the above example, the first two snippets show the string representation 
> and byte representation of the example line of text. The string shows a 
> question mark for the replacement character and the bytes reveal the 
> replacement character has been swapped in by {{Text.toString}}. The third 
> snippet shows what happens if you call {{getBytes}} on the {{Text}} object 
> which comes back from hadoop land: we get the real bytes in the file out.
> Now, I think this is a bug, though you may disagree. The text inside my file 
> is perfectly valid iso-8859-1 encoded bytes, which I would like to be able to 
> rescue and re-encode into UTF-8, because I want my application to be smart 
> like that. I think Spark should give me the raw broken string so I can 
> re-encode, but I can't get at the original bytes in order to guess at what 
> the source encoding might be, as they have already been replaced. I'm dealing 
> with data from some CDN access logs which are to put it nicely diversely 
> encoded, but I think a use case Spark should fully support. So, my suggested 
> fix, which I'd like some guidance, is to change {{textFile}} to spit out 
> broken strings by not using {{Text}}'s UTF-8 encoding.
> Further compounding this issue is that my application is actually in PySpark, 
> but we can talk about how bytes fly through to Scala land after this if we 
> agree that this is an issue at all. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1849) Broken UTF-8 encoded data gets character replacements and thus can't be "fixed"

Reply via email to