> On Sep 8, 2016, at 9:59 AM, Owen O'Malley <omal...@apache.org> wrote: > > Ok, Prasanth found a problem with my proposed approach. In particular, the > old readers would misinterpret bloom filters from new files. Therefore, I'd > like to propose a more complicated solution: > 1. We extend the stripe footer or bloom filter index to record the default > encoding when we are writing a string or decimal bloom filter. > 2. When reading a bloom filter, we use the encoding if it is present.
Does that mean that you always write with he platform encoding? This would make using bloom filters for read in other programming languages difficult because you would need to do a UTF_8 to some arbitrary character encoding. This will also make using these bloom filters in performance critical sections (join loops) computationally expensive as you have to do a transcode. Also, I think the spec need to be clarified. The spec does not state the character encoding of the bloom filters. I assumed it was UTF_8 to match the normal string column encoding. It looks like the spec does not document the meaning of "the version of the writer” and what workarounds are necessary (or operating assumptions have been made). Once we have that, we should document that old readers assume that the platform default charset is consistent for readers and writers. As and alternative, for new files we could add add a new stream ID, so the old readers skip them. -dain