+1 to bump up the writer version to facilitate correct ppd for older versions. Alan - PPD will have to look at the writer version to detect old files. Newer files will have writer version as ORC-101.
Thanks Prasanth On Wed, Sep 7, 2016 at 1:12 PM -0500, "Alan Gates" <alanfga...@gmail.com> wrote: I think using the default encoding for the old files is the best option, as it will be right 99% of the time. I was wondering how the system would know whether or not this was an old file. Alan. > On Sep 7, 2016, at 10:06, Owen O'Malley wrote: > > 4 is about when you are using the bloom filter for predicate push down. I'm > saying old files should use the default encoding when checking the bloom > filter. The other option is to always have the predicate push down say > maybe if the file is an old one. > > .. Owen > > On Wed, Sep 7, 2016 at 9:34 AM, Alan Gates wrote: > >> +1 to 1-3. On 4, what do you mean by test? Assume it’s the default >> encoding and use that? Is there a versioning concept in the bloom filters >> that will make it easy to determine if this is pre or post ORC-101? >> >> Alan. >> >>> On Sep 7, 2016, at 08:57, Owen O'Malley wrote: >>> >>> All, >>> Dain Sundstrom pointed out to me in personal email that the ORC bloom >>> filters are currently using the default character encoding. That makes >> the >>> bloom filters non-portable between different computers that use different >>> default encodings. I've filed ORC-101 to address it, but I want to have a >>> wider discussion. I'd propose that we: >>> >>> 1. create a new WriterVersion for ORC-101. >>> 2. move the bloom filter code from storage-api into ORC. >>> 3. consistently use UTF-8 when creating new bloom filters >>> 4. for ORC files older than ORC-101, test the default encoding instead of >>> UTF-8 >>> >>> Thoughts? >>> >>> .. Owen >> >>