[ 
http://issues.apache.org/jira/browse/HADOOP-550?page=comments#action_12436412 ] 
            
Hairong Kuang commented on HADOOP-550:
--------------------------------------

In java, supplementary characters, i.e., codepoints that are greater than 
U+FFFF, are represented by a pair of char values, called surrogates. If  a Text 
object is constructed from a String containing unpaired surrogates, 
CharacterCodingException is thrown.

I agree with Sameer that a Text object should contain valid UTF-8. So by 
default, probably we should replace illegal bytes with "\uFFFD" as Java does 
instead of throwing an exception. Then everything should work.

> Text constructure can throw exception
> -------------------------------------
>
>                 Key: HADOOP-550
>                 URL: http://issues.apache.org/jira/browse/HADOOP-550
>             Project: Hadoop
>          Issue Type: Bug
>            Reporter: Bryan Pendleton
>
> I finally got back around to moving my working code to using Text objects.
> And, once again, switching to Text (from UTF8) means my jobs are failing. 
> This time, its better defined - constructing a Text from a string extracted 
> from Real World data makes the Text object constructor throw a 
> CharacterCodingException. This may be legit - I don't actually understand UTF 
> well enough to understand what's wrong with the supplied string. I'm 
> assembling a series of strings, some of which are user-supplied, and 
> something causes the Text constructor to barf.
> However, this is still completely unacceptable. If I need to stuff textual 
> data someplace - I need the container to *do* it. If user-supplied inputs 
> can't be stored as a "UTF" aware text value, then another container needs to 
> be brought into existence. Sure, I can use a BytesWritable, but, as its name 
> implies - Text should handle "text". If Text is supposed to == 
> "StringWritable", then, well, it doesn't, yet.
> I admit to being a few weeks' back in the bleeding edge at this point, so 
> maybe my particluar Text bug has been fixed, though the only fixes to Text I 
> see are adopting it into more of the internals of Hadoop. This argument goes 
> double in that case - if we're using Text objects internally, it should 
> really be a totally solid object - construct one from a String, get one back, 
> but _never_  throw a content-related Exception. Or, if Text is not the right 
> object because its data-sensitive, then I argue we shouldn't use it in any 
> case where data might kill it - internal, or anywhere else (by default).
> Please, don't remove UTF8, for now.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to