[GitHub] spark issue #19180: [SPARK-21967][CORE] org.apache.spark.unsafe.types.UTF8St...

original-brownbear Thu, 14 Sep 2017 05:34:46 -0700

Github user original-brownbear commented on the issue:

    https://github.com/apache/spark/pull/19180
  
    @srowen played a little with 
https://github.com/original-brownbear/spark/blob/string-compareto-benchmark-both/core/src/test/scala/org/apache/spark/benchmarks/UTF8StringBenchmark.scala
 
    
    and got:
    
    ###master
    
    ```sh
    String compareTo:                        Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
    
------------------------------------------------------------------------------------------------
    2-7 byte                                       798 /  817         82.2      
    12.2       1.0X
    8-16 byte                                      827 /  851         79.3      
    12.6       1.0X
    16-32 byte                                     891 /  909         73.6      
    13.6       0.9X
    512-1024 byte                                 1311 / 1331         50.0      
    20.0       0.6X
    ```
    
    this branch
    
    ```sh
    String compareTo:                        Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
    
------------------------------------------------------------------------------------------------
    2-7 byte                                       942 /  972         69.6      
    14.4       1.0X
    8-16 byte                                      460 /  475        142.3      
     7.0       2.0X
    16-32 byte                                     488 /  530        134.4      
     7.4       1.9X
    512-1024 byte                                  788 /  809         83.2      
    12.0       1.2X
    ```
    
    so as expected for very short strings we're taking a hit small hit, but are 
much faster for longer strings.
    
    We could do something like this to not take the hit for short strings (this 
is the fastest I could find, admittedly not as readable as before but this is 
the version with the minimum number of instructions as far as I can tell):
    
    ```java
      @Override
      public int compareTo(@Nonnull final UTF8String other) {
        int len = Math.min(numBytes, other.numBytes);
        if (len > 7) {
          return compareLarge(
              this.base, this.offset, numBytes, other.base, other.offset, 
other.numBytes, len);
        } else {
          return compareSmall(
              this.base, this.offset, numBytes, other.base, other.offset, 
other.numBytes, len
          );
        }
      }
    
      private static int compareLarge(Object base1, long off1, int len1, Object 
base2, long off2,
          int len2, int len) {
        int wordMax = len & ~7;
        for (int i = 0; i < wordMax; i += 8) {
          long left = getLong(base1, off1 + i);
          long right = getLong(base2, off2 + i);
          if (left != right) {
            if (IS_LITTLE_ENDIAN) {
              return UnsignedLongs.compare(Long.reverseBytes(left), 
Long.reverseBytes(right));
            } else {
              return UnsignedLongs.compare(left, right);
            }
          }
        }
        for (int i = wordMax; i < len; i++) {
          // In UTF-8, the byte should be unsigned, so we should compare them 
as unsigned int.
          int res = (Platform.getByte(base1, off1 + i) & 0xFF) -
              (Platform.getByte(base2, off2 + i) & 0xFF);
          if (res != 0) {
            return res;
          }
        }
        return compareSmall(base1, off1 + wordMax, len1, base2, off2 + wordMax, 
len2, len - wordMax);
      }
    
      private static int compareSmall(Object base1, long off1, int len1, Object 
base2, long off2,
          int len2, int len) {
        for (int i = 0; i < len; i++) {
          // In UTF-8, the byte should be unsigned, so we should compare them 
as unsigned int.
          int res = (Platform.getByte(base1, off1 + i) & 0xFF) -
              (Platform.getByte(base2, off2 + i) & 0xFF);
          if (res != 0) {
            return res;
          }
        }
        return len1 - len2;
      }
    ```
    
    this version comes out neutral for small strings too:
    
    ```sh
    String compareTo:                        Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
    
------------------------------------------------------------------------------------------------
    2-7 byte                                       811 /  829         80.8      
    12.4       1.0X
    8-16 byte                                      453 /  466        144.7      
     6.9       1.8X
    16-32 byte                                     477 /  487        137.5      
     7.3       1.7X
    512-1024 byte                                  769 /  784         85.2      
    11.7       1.1X
    2-7 byte                                       811 /  830         80.8      
    12.4       1.0X
    ```
    
    added the run of the small comparison at the end again, to make sure we're 
not just cheating the JIT by keeping `compareTo` artificially small so long as 
we don't see any larger strings.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #19180: [SPARK-21967][CORE] org.apache.spark.unsafe.types.UTF8St...

Reply via email to