[ 
http://issues.apache.org/jira/browse/HADOOP-550?page=comments#action_12437958 ] 
            
Addison Phillips commented on HADOOP-550:
-----------------------------------------

If you want to have *text*, then you need to know the encoding and have some 
assurance that it is correct. A text buffer that contains random binary data 
isn't very useful: you can't do any useful *text* processing on it. The String 
class's behavior was modified post 1.4 so that instead of silently emiting a 
null string (caused by the buried CharacterCodingException), it instead 
replaces bad sequences with U+FFFD characters. The String class is a bit 
lenient about this: it allows non-shortest form UTF-8 (that is, 0xC0 0x80 == 
U+0000 aka 'NULL'), while Text's validation routine does not permit this (it's 
a security flaw to process non-shortest form UTF-8). But it doesn't return the 
original bytes if the input buffer was bad. 

Either way, I think that Text should emulate this behavior and do replacements, 
although I note that Text objects constructed with buffers that use an encoding 
other than UTF-8 will just silently do unexpected or bad things (it doesn't 
matter if you use the new Text class or the old Utf8 class, it happens either 
way).

Using the ByteBuffer version of the validation method will help implement this.

Users may not be happy to have their binary data buffers being "modified" by 
the Text class. But I'd maintain that their original records are *not* text 
records if they contain damaged data. A lot of "mostly-ASCII" buffers are 
really in Latin-1, but work okay as UTF-8 until you encounter a non-ASCII 
character. The Text class, as a wrapper around a Unicode text buffer, can 
identify these cases (where the user has misidentified the encoding). This is 
usually a bug somewhere else (your data was writting using a default 
OutputStreamWriter rather than one with UTF-8, for example). Something is 
wrong: the class should not perform questionable operations on the data. I 
could warn the programmer (Exception) or do something to prevent relatively 
worse results (replace silently).  

If what you really want is not a "text buffer" but just a byte[] or bit-bucket, 
don't use a Text object for it. That isn't what it is for. If you have a buffer 
that produces errors, you probably need to provide an encoding to convert the 
buffer or debug why the buffer contains non-UTF-8 in the first place.

> Text constructure can throw exception
> -------------------------------------
>
>                 Key: HADOOP-550
>                 URL: http://issues.apache.org/jira/browse/HADOOP-550
>             Project: Hadoop
>          Issue Type: Bug
>            Reporter: Bryan Pendleton
>
> I finally got back around to moving my working code to using Text objects.
> And, once again, switching to Text (from UTF8) means my jobs are failing. 
> This time, its better defined - constructing a Text from a string extracted 
> from Real World data makes the Text object constructor throw a 
> CharacterCodingException. This may be legit - I don't actually understand UTF 
> well enough to understand what's wrong with the supplied string. I'm 
> assembling a series of strings, some of which are user-supplied, and 
> something causes the Text constructor to barf.
> However, this is still completely unacceptable. If I need to stuff textual 
> data someplace - I need the container to *do* it. If user-supplied inputs 
> can't be stored as a "UTF" aware text value, then another container needs to 
> be brought into existence. Sure, I can use a BytesWritable, but, as its name 
> implies - Text should handle "text". If Text is supposed to == 
> "StringWritable", then, well, it doesn't, yet.
> I admit to being a few weeks' back in the bleeding edge at this point, so 
> maybe my particluar Text bug has been fixed, though the only fixes to Text I 
> see are adopting it into more of the internals of Hadoop. This argument goes 
> double in that case - if we're using Text objects internally, it should 
> really be a totally solid object - construct one from a String, get one back, 
> but _never_  throw a content-related Exception. Or, if Text is not the right 
> object because its data-sensitive, then I argue we shouldn't use it in any 
> case where data might kill it - internal, or anywhere else (by default).
> Please, don't remove UTF8, for now.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to