[ http://issues.apache.org/jira/browse/HADOOP-136?page=comments#action_12419627 ]
Michel Tourn commented on HADOOP-136: ------------------------------------- FYI: some info on Java-modified UTF-8 (this was previously posted) See Modified UTF-8 in: http://java.sun.com/developer/technicalArticles/Intl/Supplementary/ As far as i understand: The bottom-line is that supplementary UTF-8 characters: o would be encoded as 4+ bytes in non-Java programs o but they are already encoded as two Java char-s (i.e. two-bytes) when our converter code sees them. o and so the conversion to UTF-8 proceeds on these two chars independently. o So all the existing Java UTF-8 code that only handles 1..3-bytewide chars is already compliant with Java-modified UTF-8. What do the java-i18n experts think? --- Earlier comment: Concerning 4-bytes-long UTF-8 characters: it seems that this situation does not occur in "Java-modified-UTF8" The 4-byte chars are represented as 3+3. See Modified UTF-8 in: http://java.sun.com/developer/technicalArticles/Intl/Supplementary/ > Overlong UTF8's not handled well > -------------------------------- > > Key: HADOOP-136 > URL: http://issues.apache.org/jira/browse/HADOOP-136 > Project: Hadoop > Type: Bug > Components: io > Versions: 0.2.0 > Reporter: Dick King > Assignee: Michel Tourn > Priority: Minor > Fix For: 0.5.0 > Attachments: largeutf8.patch > > When we feed an overlong string to the UTF8 constructor, two suboptimal > things happen. > First, we truncate to 0xffff/3 characters on the assumption that every > character takes three bytes in UTF8. This can truncate strings that don't > need it, and it can be overoptimistic since there are characters that render > as four bytes in UTF8. > Second, the code doesn't actually handle four-byte characters. > Third, there's a behavioral discontinuity. If the string is "discovered" to > be overlong by the arbitrary limit described above, we truncate with a log > message, otherwise we signal a RuntimeException. One feels that both forms > of truncation should be treated alike. However, this issue is concealed by > the second issue; the exception will never be thrown because UTF8.utf8Length > can't return more than three times the length of its input. > I would recommend changing UTF8.utf8Length to let its caller know how many > characters of the input string will actually fit if there's an overflow > [perhaps by returning the negative of that number] and doing the truncation > accurately as needed. > -dk -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira