To my understanding, only IO of the *filter columns*' column pages are saved if 
we do this, in condition of that minmax/pagebloom decides we can *skip* 
scanning these pages.  

On 2019/11/12 03:37:08, Jacky Li <[email protected]> wrote: 
> 
> 
> On 2019/11/05 02:30:30, Manhua Jiang <[email protected]> wrote: 
> > Hi Jacky,
> >   If we create bloom filter in blocklet level, maybe too similar to bloom 
> > datamap and have to face the same problems bloom datamap facing, except the 
> > pruning is running in executor side.
> >   Page level is preferred since page size is KNOWN and this let us get rid 
> > of considering how many bit should we need in the bitmap of bloom filter, 
> > only the FPP needed to be set.
> >   I checked the problem you mentioned actually exists. This also a problem 
> > when pruning pages by page minmax. Although minmax may believes this page 
> > does not need to scan, current query logic already loaded both the 
> > datachunk3 and column pages. The IO for column page is wasted. Should we 
> > change this first? Is this worth for us to separate one IO operation into 
> > two? 
> 
> In my opinion, I think yes. We should leverage the datachunk3 and check 
> whether the column pages are needed before reading. This can reduce the IO 
> dramatically for some use case, for example, high selectivity filter query.
> 
> > 
> > Anyone interesting in this part is welcomed to share you ideas also.
> > 
> > Thanks.
> > Manhua
> > 
> > On 2019/11/04 09:15:35, Jacky Li <[email protected]> wrote: 
> > > Hi Manhua,
> > > 
> > > +1 for this feature.
> > > 
> > > One question:
> > > Since one column chunk in one blocklet is carbon's minimum IO unit, why 
> > > not
> > > create bloom filter in blocklet level? If it is page level, we still need 
> > > to
> > > read page data into memory, the saving is only for decompression.
> > > 
> > > 
> > > Regards,
> > > Jacky
> > > 
> > > 
> > > 
> > > --
> > > Sent from: 
> > > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> > > 
> > 
> 

Reply via email to