Re: Bloom filter hash broken

Prasanth Jayachandran Wed, 07 Sep 2016 12:09:06 -0700

+1 to bump up the writer version to facilitate correct ppd for older versions. 
Alan - PPD will have to look at the writer version to detect old files. Newer 
files will have writer version as ORC-101.


Thanks
Prasanth




On Wed, Sep 7, 2016 at 1:12 PM -0500, "Alan Gates" <[email protected]> wrote:










I think using the default encoding for the old files is the best option, as it 
will be right 99% of the time.  I was wondering how the system would know 
whether or not this was an old file.

Alan.

> On Sep 7, 2016, at 10:06, Owen O'Malley  wrote:
> 
> 4 is about when you are using the bloom filter for predicate push down. I'm
> saying old files should use the default encoding when checking the bloom
> filter. The other option is to always have the predicate push down say
> maybe if the file is an old one.
> 
> .. Owen
> 
> On Wed, Sep 7, 2016 at 9:34 AM, Alan Gates  wrote:
> 
>> +1 to 1-3.  On 4, what do you mean by test?  Assume it’s the default
>> encoding and use that?  Is there a versioning concept in the bloom filters
>> that will make it easy to determine if this is pre or post ORC-101?
>> 
>> Alan.
>> 
>>> On Sep 7, 2016, at 08:57, Owen O'Malley  wrote:
>>> 
>>> All,
>>>  Dain Sundstrom pointed out to me in personal email that the ORC bloom
>>> filters are currently using the default character encoding. That makes
>> the
>>> bloom filters non-portable between different computers that use different
>>> default encodings. I've filed ORC-101 to address it, but I want to have a
>>> wider discussion. I'd propose that we:
>>> 
>>> 1. create a new WriterVersion for ORC-101.
>>> 2. move the bloom filter code from storage-api into ORC.
>>> 3. consistently use UTF-8 when creating new bloom filters
>>> 4. for ORC files older than ORC-101, test the default encoding instead of
>>> UTF-8
>>> 
>>> Thoughts?
>>> 
>>> .. Owen
>> 
>>

Re: Bloom filter hash broken

Reply via email to