Re: [DISCUSSION] Page Level Bloom Filter

Manhua Tue, 26 Nov 2019 04:31:27 -0800

Hi Vishal,
   I want to ask a question. For supporting huge binary/varchar/complex data, 
the row number in a page will be larger or smaller than 32000?
  Thanks.


On 2019/11/26 11:49:54, Kumar Vishal <[email protected]> wrote: 
> Hi Manhua,
> 
> I agree with Ravindra and Vimal adding page level Bloom will not improve
> query performance much, because it will not reduce amount of data read from
> the disk.
> It will only reduce some processing time(Uncompression of pages and
> Applying filter on those pages).
> Keeping the bloom information in file footer will help in reducing the
> IO+Processing.
> 
> I agree with you finding the distinct for a column in blocklet will be
> complex because blocklet is based on size not based on rows.Page also we
> have size based configuration which is by default false now, but this we
> are planing to make true to support huge binary/varchar/complex data.
> 
> Sol1. We can ask user to pass number of cardinality of column for which he
> wants to generate the bloom.
> Sol2. Once blocklet cut is done while writing carbondatafile we can
> calculate the cardinality of column for which he wants to generate the
> bloom. If size is more we can drop the bloom for that blockelt.
> 
> I agree with Ravi keeping different FPP for executor and driver will help
> in reducing the size.
> 
> -Regards
> Kumar Vishal
> 
> 
> 
> 
> 
> 
> 
> 
> On Tue, Nov 26, 2019 at 2:22 PM Manhua <[email protected]> wrote:
> 
> > Hi Vimal,
> >    For what you concern about, if you have tried bloom datamap, you may
> > know about how difficult it is to configure the bloom parameter. You never
> > know how many (distinct) elements will be added to the bloom filter because
> > blocklet is configure by size. The more bytes of a row is, the less numer
> > of row added in blocklet. And for block level, this will be related to
> > block size configuration too. Also, please mind the size of bloom filter.
> >
> >
> > On 2019/11/26 08:24:33, Vimal Das Kammath <[email protected]>
> > wrote:
> > > I agree with ravindra that having bloom filter at Page level would not
> > save
> > > any IO. Having bloom filter at file level makes sense as it could help to
> > > prune files at the driver side. But, I am concerned on the number of
> > false
> > > positives that would result if we keep bloom filter at an entire file
> > > level. I think we need to experiment to find out the ideal
> > parameters(Bloom
> > > size and number of hash functions) that would work effectively for a file
> > > level bloom filter.
> > >
> > > Regards,
> > > Vimal
> > >
> > > On Tue, Nov 26, 2019 at 12:30 PM ravipesala <[email protected]>
> > wrote:
> > >
> > > > Hi Manhua,
> > > >
> > > > Main problem with this approach is we cannot save any IO as our IO
> > unit is
> > > > blocklet not page. Once it is already to memory I really don’t think
> > we can
> > > > get performance with bloom at page level. I feel the solution would be
> > > > efficient only the IO is saved somewhere.
> > > >
> > > > Our min/max index is efficient because it can prune the files at driver
> > > > side
> > > > and prune the blocklets and pages at the executor side. It is actually
> > > > saving lots of IO.
> > > >
> > > > Supporting bloom at carbondata file and index level is a good approach
> > > > rather than just supporting at page level. My intention is that it
> > should
> > > > behave just the same as the min/max index. So that we can prune the
> > data at
> > > > multiple levels.
> > > >
> > > > The driver side at the block level we can have a bloom with less
> > > > probability
> > > > percentage and fewer hash functions to control the size as we load it
> > to
> > > > the
> > > > memory. And in the blocklet level we can increase the probability and
> > > > hashes
> > > > little more for better pruning, gradually at page level we can
> > increase the
> > > > probability further to have a much better pruning ability.
> > > >
> > > >
> > > > Regards,
> > > > Ravindra.
> > > >
> > > >
> > > >
> > > > --
> > > > Sent from:
> > > >
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> > > >
> > >
> >
>

Re: [DISCUSSION] Page Level Bloom Filter

Reply via email to