[ 
http://issues.apache.org/jira/browse/HADOOP-302?page=comments#action_12420227 ] 

Doug Cutting commented on HADOOP-302:
-------------------------------------

Re String comparison: The bug here is with Java.  Since we wish to keep our 
persistent data structures language-independent, we should order by UTF-8, not 
UTF-16.

The javadoc is confusing:

http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html#compareTo(java.lang.String)

It says it compares unicode characters, when in fact it compares UTF-16.

So any code that orders by Java String and expects things to align with the 
Hadoop Text class will be buggy when processing text with surrogate pairs.  We 
should make this clear in the javadoc.

Does this sound reasonable?

> class Text (replacement for class UTF8) was: HADOOP-136
> -------------------------------------------------------
>
>          Key: HADOOP-302
>          URL: http://issues.apache.org/jira/browse/HADOOP-302
>      Project: Hadoop
>         Type: Improvement

>   Components: io
>     Reporter: Michel Tourn
>     Assignee: Hairong Kuang

>
> Just to verify, which length-encoding scheme are we using for class Text (aka 
> LargeUTF8) 
> a) The "UTF-8/Lucene" scheme? (highest bit of each byte is an extension bit, 
> which I think is what Doug is describing in his last comment) or 
> b) the record-IO scheme in o.a.h.record.Utils.java:readInt 
> Either way, note that: 
> 1. UTF8.java and its successor Text.java need to read the length in two ways: 
>   1a. consume 1+ bytes from a DataInput and 
>   1b. parse the length within a byte array at a given offset 
> (1.b is used for the "WritableComparator optimized for UTF8 keys" ). 
> o.a.h.record.Utils only supports the DataInput mode. 
> It is not clear to me what is the best way to extend this Utils code when you 
> need to support both reading modes 
> 2 Methods like UTF8's WritableComparator are to be low overhead, in partic. 
> there should be no Object allocation. 
> For the byte array case, the varlen-reader utility needs to be extended to 
> return both: 
>  the decoded length and the length of the encoded length. 
>  (so that the caller can do offset += encodedlength) 
>     
> 3. A String length does not need (small) negative integers. 
> 4. One advantage of a) is that it is standard (or at least well-known and 
> natural) and there are no magic constants (like -120, -121 -124) 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to