[ http://issues.apache.org/jira/browse/HADOOP-136?page=all ]
Michel Tourn updated HADOOP-136:
--------------------------------
Attachment: largeutf8.patch
Here is the patch, two new files:
org.apache.hadoop.io.TestLargeUTF8
org.apache.hadoop.io.LargeUTF8
The only difference with the UTF8 string format is that
the length is stored on 4 bytes rather than 2 bytes.
TestLargeUTF8 tests serialization of larger strings up to 1MB
> Overlong UTF8's not handled well
> --------------------------------
>
> Key: HADOOP-136
> URL: http://issues.apache.org/jira/browse/HADOOP-136
> Project: Hadoop
> Type: Bug
> Components: io
> Reporter: Dick King
> Priority: Minor
> Attachments: largeutf8.patch
>
> When we feed an overlong string to the UTF8 constructor, two suboptimal
> things happen.
> First, we truncate to 0xffff/3 characters on the assumption that every
> character takes three bytes in UTF8. This can truncate strings that don't
> need it, and it can be overoptimistic since there are characters that render
> as four bytes in UTF8.
> Second, the code doesn't actually handle four-byte characters.
> Third, there's a behavioral discontinuity. If the string is "discovered" to
> be overlong by the arbitrary limit described above, we truncate with a log
> message, otherwise we signal a RuntimeException. One feels that both forms
> of truncation should be treated alike. However, this issue is concealed by
> the second issue; the exception will never be thrown because UTF8.utf8Length
> can't return more than three times the length of its input.
> I would recommend changing UTF8.utf8Length to let its caller know how many
> characters of the input string will actually fit if there's an overflow
> [perhaps by returning the negative of that number] and doing the truncation
> accurately as needed.
> -dk
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira