[
https://issues.apache.org/jira/browse/SPARK-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000460#comment-14000460
]
Harry Brundage edited comment on SPARK-1849 at 5/16/14 11:02 PM:
-----------------------------------------------------------------
I disagree - the data isn't badly encoded, just not UTF-8 encoded, which when
we're talking about data from the internet really isn't all that uncommon. You
could extend my specific problem of some lines in the source file being a
different encoding to a file entirely encoded in iso-8859-1, which is likely
something Spark should deal with considering all the effort put into supporting
Windows. I don't think asking users to drop down to writing a custom
{{InputFormat}} to deal with the realities of large data is a good move if
Spark wants to become the fast and general data processing engine for large
scale data.
I could certainly use {{sc.hadoopFile}} to load in my data and work with the
{{org.apache.hadoop.io.Text}} objects myself, but A) why force everyone dealing
with this issue to go through the pain of figuring that out, and B) I'm in
PySpark where I can't actually do that without fancy Py4J trickery. I think
encoding issues should be in your face.
was (Author: airhorns):
I disagree - the data isn't badly encoded, just not UTF-8 encoded, which when
we're talking about data from the internet really isn't all that uncommon. You
could extend my specific problem of some lines in the source file being a
different encoding to a file entirely encoded in iso-8859-1, which is likely
something Spark should deal with considering all the effort put into supporting
Windows. I don't think asking users to drop down to writing custom
{{InputFormat}}s to deal with the realities of large data is a good move if
Spark wants to become the fast and general data processing engine for large
scale data.
I could certainly use {{sc.hadoopFile}} to load in my data and work with the
{{org.apache.hadoop.io.Text}} objects myself, but A) why force everyone dealing
with this issue to go through the pain of figuring that out, and B) I'm in
PySpark where I can't actually do that without fancy Py4J trickery. I think
encoding issues should be in your face.
> Broken UTF-8 encoded data gets character replacements and thus can't be
> "fixed"
> -------------------------------------------------------------------------------
>
> Key: SPARK-1849
> URL: https://issues.apache.org/jira/browse/SPARK-1849
> Project: Spark
> Issue Type: Bug
> Reporter: Harry Brundage
> Fix For: 1.0.0, 0.9.1
>
> Attachments: encoding_test
>
>
> I'm trying to process a file which isn't valid UTF-8 data inside hadoop using
> Spark via {{sc.textFile()}}. Is this possible, and if not, is this a bug that
> we should fix? It looks like {{HadoopRDD}} uses
> {{org.apache.hadoop.io.Text.toString}} on all the data it ever reads, which I
> believe replaces invalid UTF-8 byte sequences with the UTF-8 replacement
> character, \uFFFD. Some example code mimicking what {{sc.textFile}} does
> underneath:
> {code}
> scala> sc.textFile(path).collect()(0)
> res8: String = ?pple
> scala> sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable],
> classOf[Text]).map(pair => pair._2.toString).collect()(0).getBytes()
> res9: Array[Byte] = Array(-17, -65, -67, 112, 112, 108, 101)
> scala> sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable],
> classOf[Text]).map(pair => pair._2.getBytes).collect()(0)
> res10: Array[Byte] = Array(-60, 112, 112, 108, 101)
> {code}
> In the above example, the first two snippets show the string representation
> and byte representation of the example line of text. The third snippet shows
> what happens if you call {{getBytes}} on the {{Text}} object which comes back
> from hadoop land: we get the real bytes in the file out.
> Now, I think this is a bug, though you may disagree. The text inside my file
> is perfectly valid iso-8859-1 encoded bytes, which I would like to be able to
> rescue and re-encode into UTF-8, because I want my application to be smart
> like that. I think Spark should give me the raw broken string so I can
> re-encode, but I can't get at the original bytes in order to guess at what
> the source encoding might be, as they have already been replaced. I'm dealing
> with data from some CDN access logs which are to put it nicely diversely
> encoded, but I think a use case Spark should fully support. So, my suggested
> fix, which I'd like some guidance, is to change {{textFile}} to spit out
> broken strings by not using {{Text}}'s UTF-8 encoding.
> Further compounding this issue is that my application is actually in PySpark,
> but we can talk about how bytes fly through to Scala land after this if we
> agree that this is an issue at all.
--
This message was sent by Atlassian JIRA
(v6.2#6252)