Hi Vishal, I want to ask a question. For supporting huge binary/varchar/complex data, the row number in a page will be larger or smaller than 32000? Thanks.
On 2019/11/26 11:49:54, Kumar Vishal <[email protected]> wrote: > Hi Manhua, > > I agree with Ravindra and Vimal adding page level Bloom will not improve > query performance much, because it will not reduce amount of data read from > the disk. > It will only reduce some processing time(Uncompression of pages and > Applying filter on those pages). > Keeping the bloom information in file footer will help in reducing the > IO+Processing. > > I agree with you finding the distinct for a column in blocklet will be > complex because blocklet is based on size not based on rows.Page also we > have size based configuration which is by default false now, but this we > are planing to make true to support huge binary/varchar/complex data. > > Sol1. We can ask user to pass number of cardinality of column for which he > wants to generate the bloom. > Sol2. Once blocklet cut is done while writing carbondatafile we can > calculate the cardinality of column for which he wants to generate the > bloom. If size is more we can drop the bloom for that blockelt. > > I agree with Ravi keeping different FPP for executor and driver will help > in reducing the size. > > -Regards > Kumar Vishal > > > > > > > > > On Tue, Nov 26, 2019 at 2:22 PM Manhua <[email protected]> wrote: > > > Hi Vimal, > > For what you concern about, if you have tried bloom datamap, you may > > know about how difficult it is to configure the bloom parameter. You never > > know how many (distinct) elements will be added to the bloom filter because > > blocklet is configure by size. The more bytes of a row is, the less numer > > of row added in blocklet. And for block level, this will be related to > > block size configuration too. Also, please mind the size of bloom filter. > > > > > > On 2019/11/26 08:24:33, Vimal Das Kammath <[email protected]> > > wrote: > > > I agree with ravindra that having bloom filter at Page level would not > > save > > > any IO. Having bloom filter at file level makes sense as it could help to > > > prune files at the driver side. But, I am concerned on the number of > > false > > > positives that would result if we keep bloom filter at an entire file > > > level. I think we need to experiment to find out the ideal > > parameters(Bloom > > > size and number of hash functions) that would work effectively for a file > > > level bloom filter. > > > > > > Regards, > > > Vimal > > > > > > On Tue, Nov 26, 2019 at 12:30 PM ravipesala <[email protected]> > > wrote: > > > > > > > Hi Manhua, > > > > > > > > Main problem with this approach is we cannot save any IO as our IO > > unit is > > > > blocklet not page. Once it is already to memory I really don’t think > > we can > > > > get performance with bloom at page level. I feel the solution would be > > > > efficient only the IO is saved somewhere. > > > > > > > > Our min/max index is efficient because it can prune the files at driver > > > > side > > > > and prune the blocklets and pages at the executor side. It is actually > > > > saving lots of IO. > > > > > > > > Supporting bloom at carbondata file and index level is a good approach > > > > rather than just supporting at page level. My intention is that it > > should > > > > behave just the same as the min/max index. So that we can prune the > > data at > > > > multiple levels. > > > > > > > > The driver side at the block level we can have a bloom with less > > > > probability > > > > percentage and fewer hash functions to control the size as we load it > > to > > > > the > > > > memory. And in the blocklet level we can increase the probability and > > > > hashes > > > > little more for better pruning, gradually at page level we can > > increase the > > > > probability further to have a much better pruning ability. > > > > > > > > > > > > Regards, > > > > Ravindra. > > > > > > > > > > > > > > > > -- > > > > Sent from: > > > > > > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ > > > > > > > > > >
