Overlong UTF8's not handled well
--------------------------------

         Key: HADOOP-136
         URL: http://issues.apache.org/jira/browse/HADOOP-136
     Project: Hadoop
        Type: Bug

  Components: io  
    Reporter: Dick King
    Priority: Minor


When we feed an overlong string to the UTF8 constructor, two suboptimal things 
happen.

First, we truncate to 0xffff/3 characters on the assumption that every 
character takes three bytes in UTF8.  This can truncate strings that don't need 
it, and it can be overoptimistic since there are characters that render as four 
bytes in UTF8.

Second, the code doesn't actually handle four-byte characters.

Third, there's a behavioral discontinuity.  If the string is "discovered" to be 
overlong by the arbitrary limit described above, we truncate with a log 
message, otherwise we signal a RuntimeException.  One feels that both forms of 
truncation should be treated alike.  However, this issue is concealed by the 
second issue; the exception will never be thrown because UTF8.utf8Length can't 
return more than three times the length of its input.

I would recommend changing UTF8.utf8Length to let its caller know how many 
characters of the input string will actually fit if there's an overflow 
[perhaps by returning the negative of that number] and doing the truncation 
accurately as needed.

-dk



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to