Github user omalley commented on the issue:
https://github.com/apache/orc/pull/169
Ok, starting with the representation. I'd suggest it look like:
message StringStatistics {
optional string minimum = 1;
optional string maximum = 2;
// sum will store the total length of all strings in a stripe
optional sint64 sum = 3;
// If the minimum or maximum value was longer than 1024 bytes, store
a lower or upper
// bound instead of the minimum or maximum values above.
optional string lowerBound = 4;
optional string upperBound = 5;
}
Now obviously the lowerBound can just be the string truncated (at a utf8
character boundary!) to at most 1024 bytes. The upperBound is the same with the
last code point increased by one.
In the StringStatisticsImpl, I'd keep two boolean flags as to whether it is
a real value or an approximation for minimum or maximum. The value comparison
is the same, since unless the current value is less than the lower bound, it
won't change the lower bound and the same is true for the upper bound. If the
new minimum/maximum is not truncated, the corresponding wasTruncated flag
should be cleared. When merging, the flag follows the value. In the corner case
of two identical values where one was truncated, the non-truncated one is the
result.
We should end up with four methods for each:
* String getMinimum();
* String getLowerBound();
* String getMaximum();
* String getUpperBound();
If we only have a lower bound, getMinimum should be null and the same with
upper bound and getMaximum. getLowerBound and getUpperBound should match
getMinimum and getMaximum, if no truncation was done.
---