Yes, HIVE-7144 went in before HIVE-8732 so any file with WriterVersion >= 1 should be UTF-8 in the statistics.
.. Owen On Tue, Jun 6, 2017 at 4:05 PM, Dain Sundstrom <[email protected]> wrote: > Ah I see. I can’t believe I missed this fix :) > > Our reader was originally written in the 0.13 days, and which used Strings > for stats. This is the commit that changed everything to text and I > believe it went out with Hive 0.14: > > https://github.com/apache/hive/commit/6072e3aed88d9246e1130abadf3c15 > a88e975b4e#diff-340d190f994d92658b24aae1edf610b3 > > Is writer version "1 = HIVE-8732 fixed” after 0.14? If so I can update my > reader to detect this. > > -dain > > > On Jun 6, 2017, at 3:36 PM, Owen O'Malley <[email protected]> > wrote: > > > > On Tue, Jun 6, 2017 at 3:02 PM, Dain Sundstrom <[email protected]> wrote: > > > >> Is it required that the StringStatistics min and max be the actual min > and > >> max value for the column? I ask for two reasons, I’d like to be able to > >> “trim” values if the min or max is very large. Also, as a work around > of > >> for the UTF-16be sorting problem (bug?), I’d like to trim values at the > >> first surrogate pair, so the value is slightly smaller than the min or > >> larger than the max, and still a valid UTF-8 sequence. > >> > > > > I agree that we want to be able to trim the values. I've seen cases where > > the String is huge (~100k) and makes the StringStatistics huge. I'd > propose > > that we do something like: > > > > message StringStatistics { > > optional string minimum = 1; > > optional string maximum = 2; > > // sum will store the total length of all strings in a stripe > > optional sint64 sum = 3; > > // if set, the minimum will not be set and the lowerBound <= all values > > optional string lowerBound = 4; > > // if set, the maximum will not be set and the upperBound >= all values > > optional string upperBound = 5; > > } > > > > We shouldn't have any UTF16 in ORC. Is there a case where we compare > > strings that way? In particular, the StringStatistics uses Text, which > uses > > UTF-8 as its encoding. > > > > .. Owen > > > > > >> Thoughts? > >> > >> -dain > >> > >> > >
