[ 
http://issues.apache.org/jira/browse/HADOOP-550?page=comments#action_12437954 ] 
            
Hairong Kuang commented on HADOOP-550:
--------------------------------------

Currently class Text is the default class for text inputs. Because only valid 
UTF8 bytes are allowed in Text, if user inputs contain non-UTF8 bytes and Text 
records are written back to output files, a side effect is that no-UTF8 bytes 
are replaced or dropped in the output files. Users may not be happy about this.

An alternative way is to allow non-UTF8 bytes in Text by removing the 
validation in Text constructors. This gives users greater freedom to handle 
invalid UTF8 bytes in their code if there is any and allows them to write 
orginal text records back to output files. 

> Text constructure can throw exception
> -------------------------------------
>
>                 Key: HADOOP-550
>                 URL: http://issues.apache.org/jira/browse/HADOOP-550
>             Project: Hadoop
>          Issue Type: Bug
>            Reporter: Bryan Pendleton
>
> I finally got back around to moving my working code to using Text objects.
> And, once again, switching to Text (from UTF8) means my jobs are failing. 
> This time, its better defined - constructing a Text from a string extracted 
> from Real World data makes the Text object constructor throw a 
> CharacterCodingException. This may be legit - I don't actually understand UTF 
> well enough to understand what's wrong with the supplied string. I'm 
> assembling a series of strings, some of which are user-supplied, and 
> something causes the Text constructor to barf.
> However, this is still completely unacceptable. If I need to stuff textual 
> data someplace - I need the container to *do* it. If user-supplied inputs 
> can't be stored as a "UTF" aware text value, then another container needs to 
> be brought into existence. Sure, I can use a BytesWritable, but, as its name 
> implies - Text should handle "text". If Text is supposed to == 
> "StringWritable", then, well, it doesn't, yet.
> I admit to being a few weeks' back in the bleeding edge at this point, so 
> maybe my particluar Text bug has been fixed, though the only fixes to Text I 
> see are adopting it into more of the internals of Hadoop. This argument goes 
> double in that case - if we're using Text objects internally, it should 
> really be a totally solid object - construct one from a String, get one back, 
> but _never_  throw a content-related Exception. Or, if Text is not the right 
> object because its data-sensitive, then I argue we shouldn't use it in any 
> case where data might kill it - internal, or anywhere else (by default).
> Please, don't remove UTF8, for now.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to