One more wrinkle on this that I just discovered is that I inadvertently fixed the problem for strings (but not decimals) as part of HIVE-12055, which was released as part of Hive 2.1. So that means we currently have two versions out there:
WriterVersion String charset Decimal charset < HIVE-12055 jvm default jvm default >= HIVE-12055 UTF-8 jvm default So I'm going ahead with the BLOOM_FILTER_UTF8 stream (and optionally the old BLOOM_FILTER stream), but I'll make the reader recognize the WriterVersion >= HIVE-12055 and trust BLOOM_FILTER for string bloom filters, but not the decimal ones. .. Owen On Thu, Sep 8, 2016 at 7:25 PM, Dain Sundstrom <d...@iq80.com> wrote: > Sounds good to me. > > Should we add a version field to the BLOOM_FILTER_UTF8 to deal with any > future problems? > > One other thought, in the protobuf definition I think it would be more > efficient to have the bitset encoded as a byte[] to avoid the boxed long > array. > > -dain > > > On Sep 8, 2016, at 3:42 PM, Owen O'Malley <omal...@apache.org> wrote: > > > > Dain, > > That is a great point. I wasn't thinking about having to implement that > > in C++, where it would really suck. (It was hard enough dealing with the > > timezones in C++, so I should know better!) I got the approach from this > > morning working and pushed it as a branch > > https://github.com/omalley/orc/commit/38752621863bf8dc1f05a6e7d34552 > 969395e5f5 > > . > > > > The heaviest hammer would be to just create a new file version, but that > > seems like overkill for this. (Although sooner or later we will get > there, > > since we need new encodings for decimal.) > > > > Ok, so how about: > > > > 1. We create a new stream kind (BLOOM_FILTER_UTF8) that always has UTF-8 > > based bloom filters. It will be used for strings, chars, varchars, and > > decimal. (Hopefully all charsets use the same bytes for ascii characters, > > but I don't want to find the strange exceptions.) > > > > 2. We create a new config knob that let's you write both BLOOM_FILTER and > > BLOOM_FILTER_UTF8 for users while they are transitioning. > > > > 3. The reader prefers BLOOM_FILTER_UTF8, but will fall back to > BLOOM_FILTER > > if it is an old file. > > > > Thoughts? > > > > .. Owen > > > > On Thu, Sep 8, 2016 at 11:02 AM, Dain Sundstrom <d...@iq80.com> wrote: > > > >>> On Sep 8, 2016, at 9:59 AM, Owen O'Malley <omal...@apache.org> wrote: > >>> > >>> Ok, Prasanth found a problem with my proposed approach. In particular, > >> the > >>> old readers would misinterpret bloom filters from new files. Therefore, > >> I'd > >>> like to propose a more complicated solution: > >>> 1. We extend the stripe footer or bloom filter index to record the > >> default > >>> encoding when we are writing a string or decimal bloom filter. > >>> 2. When reading a bloom filter, we use the encoding if it is present. > >> > >> Does that mean that you always write with he platform encoding? This > >> would make using bloom filters for read in other programming languages > >> difficult because you would need to do a UTF_8 to some arbitrary > character > >> encoding. This will also make using these bloom filters in performance > >> critical sections (join loops) computationally expensive as you have to > do > >> a transcode. > >> > >> Also, I think the spec need to be clarified. The spec does not state > the > >> character encoding of the bloom filters. I assumed it was UTF_8 to > match > >> the normal string column encoding. It looks like the spec does not > >> document the meaning of "the version of the writer” and what workarounds > >> are necessary (or operating assumptions have been made). Once we have > >> that, we should document that old readers assume that the platform > default > >> charset is consistent for readers and writers. > >> > >> As and alternative, for new files we could add add a new stream ID, so > the > >> old readers skip them. > >> > >> -dain > >