[ http://issues.apache.org/jira/browse/HADOOP-550?page=comments#action_12437958 ] Addison Phillips commented on HADOOP-550: -----------------------------------------
If you want to have *text*, then you need to know the encoding and have some assurance that it is correct. A text buffer that contains random binary data isn't very useful: you can't do any useful *text* processing on it. The String class's behavior was modified post 1.4 so that instead of silently emiting a null string (caused by the buried CharacterCodingException), it instead replaces bad sequences with U+FFFD characters. The String class is a bit lenient about this: it allows non-shortest form UTF-8 (that is, 0xC0 0x80 == U+0000 aka 'NULL'), while Text's validation routine does not permit this (it's a security flaw to process non-shortest form UTF-8). But it doesn't return the original bytes if the input buffer was bad. Either way, I think that Text should emulate this behavior and do replacements, although I note that Text objects constructed with buffers that use an encoding other than UTF-8 will just silently do unexpected or bad things (it doesn't matter if you use the new Text class or the old Utf8 class, it happens either way). Using the ByteBuffer version of the validation method will help implement this. Users may not be happy to have their binary data buffers being "modified" by the Text class. But I'd maintain that their original records are *not* text records if they contain damaged data. A lot of "mostly-ASCII" buffers are really in Latin-1, but work okay as UTF-8 until you encounter a non-ASCII character. The Text class, as a wrapper around a Unicode text buffer, can identify these cases (where the user has misidentified the encoding). This is usually a bug somewhere else (your data was writting using a default OutputStreamWriter rather than one with UTF-8, for example). Something is wrong: the class should not perform questionable operations on the data. I could warn the programmer (Exception) or do something to prevent relatively worse results (replace silently). If what you really want is not a "text buffer" but just a byte[] or bit-bucket, don't use a Text object for it. That isn't what it is for. If you have a buffer that produces errors, you probably need to provide an encoding to convert the buffer or debug why the buffer contains non-UTF-8 in the first place. > Text constructure can throw exception > ------------------------------------- > > Key: HADOOP-550 > URL: http://issues.apache.org/jira/browse/HADOOP-550 > Project: Hadoop > Issue Type: Bug > Reporter: Bryan Pendleton > > I finally got back around to moving my working code to using Text objects. > And, once again, switching to Text (from UTF8) means my jobs are failing. > This time, its better defined - constructing a Text from a string extracted > from Real World data makes the Text object constructor throw a > CharacterCodingException. This may be legit - I don't actually understand UTF > well enough to understand what's wrong with the supplied string. I'm > assembling a series of strings, some of which are user-supplied, and > something causes the Text constructor to barf. > However, this is still completely unacceptable. If I need to stuff textual > data someplace - I need the container to *do* it. If user-supplied inputs > can't be stored as a "UTF" aware text value, then another container needs to > be brought into existence. Sure, I can use a BytesWritable, but, as its name > implies - Text should handle "text". If Text is supposed to == > "StringWritable", then, well, it doesn't, yet. > I admit to being a few weeks' back in the bleeding edge at this point, so > maybe my particluar Text bug has been fixed, though the only fixes to Text I > see are adopting it into more of the internals of Hadoop. This argument goes > double in that case - if we're using Text objects internally, it should > really be a totally solid object - construct one from a String, get one back, > but _never_ throw a content-related Exception. Or, if Text is not the right > object because its data-sensitive, then I argue we shouldn't use it in any > case where data might kill it - internal, or anywhere else (by default). > Please, don't remove UTF8, for now. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira