Github user javadba commented on the pull request:

    https://github.com/apache/spark/pull/1586#issuecomment-50923836
  
    I am delayed in providing the next implementation, due to continuing 
investigation here.
    
    1) The default encoding seems to have changed in jdk 6 (ISO-8859-1) to jdk7 
(UTF-8?).
    
    Here the getBytes method returns substantially different results based on 
which JDK
    
    JDK 6:  "\u2345".getBytes.length   A: 3
    JDK 7:  "\u2345".getBytes.length   A: 1
    
    
    2)  I have been absorbing the Hive binary/string implementations.  The 
logic is short/easy to follow. But I am working through how to properly test 
this.  Based on Takuya's (correct) comment about hive length support for both 
binary and string: let us take a look at it first:
    
    From hive.o.a.h.h.ql.udf.UDFLength
    
    @Description(name = "length",
            value = "_FUNC_(str | binary) - Returns the length of str or number 
of bytes in binary data",
            extended = "Example:\n"
            + "  > SELECT _FUNC_('Facebook') FROM src LIMIT 1;\n" + "  8")
        @VectorizedExpressions({StringLength.class})
        public class UDFLength extends UDF {
          private final IntWritable result = new IntWritable();
        
          public IntWritable evaluate(Text s) {
            if (s == null) {
              return null;
            }
        
            byte[] data = s.getBytes();
            int len = 0;
            for (int i = 0; i < s.getLength(); i++) {
              if (GenericUDFUtils.isUtfStartByte(data[i])) {
                len++;
              }
            }
        
            result.set(len);
            return result;
          }
            public IntWritable evaluate(BytesWritable bw){
            if (bw == null){
              return null;
        
            }
            result.set(bw.getLength());
            return result;
          }
        }
    
    So in hive the invocations of length would be:
    
    String: select length(my_string) from some_table;
    binary: select length(cast(my_string as binary)) from some_table;
    
    As noted above - the result will likely not be consistent across all 
instances of hive: in particular hive on jdk6 should be having a different 
answer (I only have jdk 7 in my testing environment).  I am still pondering how 
to handle these differences.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to