Re: Bloom filter hash broken

Dain Sundstrom Thu, 08 Sep 2016 11:03:06 -0700

> On Sep 8, 2016, at 9:59 AM, Owen O'Malley <[email protected]> wrote:
> 
> Ok, Prasanth found a problem with my proposed approach. In particular, the
> old readers would misinterpret bloom filters from new files. Therefore, I'd
> like to propose a more complicated solution:
> 1. We extend the stripe footer or bloom filter index to record the default
> encoding when we are writing a string or decimal bloom filter.
> 2. When reading a bloom filter, we use the encoding if it is present.


Does that mean that you always write with he platform encoding?  This would 
make using bloom filters for read in other programming languages difficult 
because you would need to do a UTF_8 to some arbitrary character encoding.  
This will also make using these bloom filters in performance critical sections 
(join loops) computationally expensive as you have to do a transcode.

Also, I think the spec need to be clarified.  The spec does not state the 
character encoding of the bloom filters.  I assumed it was UTF_8 to match the 
normal string column encoding.  It looks like the spec does not document the 
meaning of "the version of the writer” and what workarounds are necessary (or 
operating assumptions have been made).  Once we have that, we should document 
that old readers assume that the platform default charset is consistent for 
readers and writers. 

As and alternative, for new files we could add add a new stream ID, so the old 
readers skip them.

-dain

Re: Bloom filter hash broken

Reply via email to