Github user omalley commented on the issue:

    https://github.com/apache/orc/pull/169
  
    Ok, starting with the representation. I'd suggest it look like:
    
        message StringStatistics {
          optional string minimum = 1;
          optional string maximum = 2;
          // sum will store the total length of all strings in a stripe
          optional sint64 sum = 3;
          // If the minimum or maximum value was longer than 1024 bytes, store 
a lower or upper
          // bound instead of the minimum or maximum values above.
          optional string lowerBound = 4;
          optional string upperBound = 5;
        }
    
    Now obviously the lowerBound can just be the string truncated (at a utf8 
character boundary!) to at most 1024 bytes. The upperBound is the same with the 
last code point increased by one.
    
    In the StringStatisticsImpl, I'd keep two boolean flags as to whether it is 
a real value or an approximation for minimum or maximum. The value comparison 
is the same, since unless the current value is less than the lower bound, it 
won't change the lower bound and the same is true for the upper bound. If the 
new minimum/maximum is not truncated, the corresponding wasTruncated flag 
should be cleared. When merging, the flag follows the value. In the corner case 
of two identical values where one was truncated, the non-truncated one is the 
result.
    
    We should end up with four methods for each:
    * String getMinimum();
    * String getLowerBound();
    * String getMaximum();
    * String getUpperBound();
    
    If we only have a lower bound, getMinimum should be null and the same with 
upper bound and getMaximum. getLowerBound and getUpperBound should match 
getMinimum and getMaximum, if no truncation was done.


---

Reply via email to