[jira] Commented: (HADOOP-136) Overlong UTF8's not handled well

Michel Tourn (JIRA) Thu, 06 Jul 2006 17:14:20 -0700

    [ 
http://issues.apache.org/jira/browse/HADOOP-136?page=comments#action_12419627 ]


Michel Tourn commented on HADOOP-136:
-------------------------------------

FYI:
some info on Java-modified UTF-8 
(this was previously posted)
See Modified UTF-8 in: 
http://java.sun.com/developer/technicalArticles/Intl/Supplementary/ 

As far as i understand:
The bottom-line is that supplementary UTF-8 characters:
o would be encoded as 4+ bytes in non-Java programs
o but they are already encoded as two Java char-s (i.e. two-bytes) when our 
converter code sees them.
o and so the conversion to UTF-8 proceeds on these two chars independently.
o So all the existing Java UTF-8 code that only handles 1..3-bytewide chars is 
already compliant with Java-modified UTF-8.

What do the java-i18n experts think?

---
Earlier comment:

Concerning 4-bytes-long UTF-8 characters: 
it seems that this situation does not occur in "Java-modified-UTF8" 

The 4-byte chars are represented as 3+3. 
See Modified UTF-8 in: 
http://java.sun.com/developer/technicalArticles/Intl/Supplementary/ 


> Overlong UTF8's not handled well
> --------------------------------
>
>          Key: HADOOP-136
>          URL: http://issues.apache.org/jira/browse/HADOOP-136
>      Project: Hadoop
>         Type: Bug

>   Components: io
>     Versions: 0.2.0
>     Reporter: Dick King
>     Assignee: Michel Tourn
>     Priority: Minor
>      Fix For: 0.5.0
>  Attachments: largeutf8.patch
>
> When we feed an overlong string to the UTF8 constructor, two suboptimal 
> things happen.
> First, we truncate to 0xffff/3 characters on the assumption that every 
> character takes three bytes in UTF8.  This can truncate strings that don't 
> need it, and it can be overoptimistic since there are characters that render 
> as four bytes in UTF8.
> Second, the code doesn't actually handle four-byte characters.
> Third, there's a behavioral discontinuity.  If the string is "discovered" to 
> be overlong by the arbitrary limit described above, we truncate with a log 
> message, otherwise we signal a RuntimeException.  One feels that both forms 
> of truncation should be treated alike.  However, this issue is concealed by 
> the second issue; the exception will never be thrown because UTF8.utf8Length 
> can't return more than three times the length of its input.
> I would recommend changing UTF8.utf8Length to let its caller know how many 
> characters of the input string will actually fit if there's an overflow 
> [perhaps by returning the negative of that number] and doing the truncation 
> accurately as needed.
> -dk

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-136) Overlong UTF8's not handled well

Reply via email to