[ 
http://issues.apache.org/jira/browse/HADOOP-550?page=comments#action_12436728 ] 
            
Addison Phillips commented on HADOOP-550:
-----------------------------------------

I had a hand in advising about this code (and wrote some of it). I agree with 
Doug Cutting that the current implementation is rather too strict. The 
CodingErrorAction of REPLACE is probably preferable as a default action.

One important difference between Text's implementation of UTF-8 and Java's 
String class: is: Text validates for non-shortest form UTF-8, while String does 
not. Non-shortest form UTF-8 is sometimes exploited as a security flaw (and is 
NOT valid UTF-8, by rule) and the validate() method inside Text prevents 
non-shortest sequences in Text objects.

I'd also note that non-UTF-8 data is pretty common, not usually because of 
unpaired surrogates, but rather because of bad encoding identification or 
because of mixed UTF-8 and binary data. Blowing chunks on that data is not a 
good choice as the default.

Providing for validation and tailorable reporting, a la NIO, actually would be 
the best course of action. If the buffer isn't really UTF-8 and turns out to be 
an entirely different encoding (probably the most common encoding problem), 
sometimes you might want to catch it as an exception, but most often you'll 
probably be fine plunging ahead with U+FFFD (one per bad byte). For the reason 
cited, I'd use the stricter Unicode rules for non-shortest UTF-8, but certainly 
think that  just throwing an exception is too strict.



> Text constructure can throw exception
> -------------------------------------
>
>                 Key: HADOOP-550
>                 URL: http://issues.apache.org/jira/browse/HADOOP-550
>             Project: Hadoop
>          Issue Type: Bug
>            Reporter: Bryan Pendleton
>
> I finally got back around to moving my working code to using Text objects.
> And, once again, switching to Text (from UTF8) means my jobs are failing. 
> This time, its better defined - constructing a Text from a string extracted 
> from Real World data makes the Text object constructor throw a 
> CharacterCodingException. This may be legit - I don't actually understand UTF 
> well enough to understand what's wrong with the supplied string. I'm 
> assembling a series of strings, some of which are user-supplied, and 
> something causes the Text constructor to barf.
> However, this is still completely unacceptable. If I need to stuff textual 
> data someplace - I need the container to *do* it. If user-supplied inputs 
> can't be stored as a "UTF" aware text value, then another container needs to 
> be brought into existence. Sure, I can use a BytesWritable, but, as its name 
> implies - Text should handle "text". If Text is supposed to == 
> "StringWritable", then, well, it doesn't, yet.
> I admit to being a few weeks' back in the bleeding edge at this point, so 
> maybe my particluar Text bug has been fixed, though the only fixes to Text I 
> see are adopting it into more of the internals of Hadoop. This argument goes 
> double in that case - if we're using Text objects internally, it should 
> really be a totally solid object - construct one from a String, get one back, 
> but _never_  throw a content-related Exception. Or, if Text is not the right 
> object because its data-sensitive, then I argue we shouldn't use it in any 
> case where data might kill it - internal, or anywhere else (by default).
> Please, don't remove UTF8, for now.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to