To my understanding, only IO of the *filter columns*' column pages are saved if we do this, in condition of that minmax/pagebloom decides we can *skip* scanning these pages.
On 2019/11/12 03:37:08, Jacky Li <[email protected]> wrote: > > > On 2019/11/05 02:30:30, Manhua Jiang <[email protected]> wrote: > > Hi Jacky, > > If we create bloom filter in blocklet level, maybe too similar to bloom > > datamap and have to face the same problems bloom datamap facing, except the > > pruning is running in executor side. > > Page level is preferred since page size is KNOWN and this let us get rid > > of considering how many bit should we need in the bitmap of bloom filter, > > only the FPP needed to be set. > > I checked the problem you mentioned actually exists. This also a problem > > when pruning pages by page minmax. Although minmax may believes this page > > does not need to scan, current query logic already loaded both the > > datachunk3 and column pages. The IO for column page is wasted. Should we > > change this first? Is this worth for us to separate one IO operation into > > two? > > In my opinion, I think yes. We should leverage the datachunk3 and check > whether the column pages are needed before reading. This can reduce the IO > dramatically for some use case, for example, high selectivity filter query. > > > > > Anyone interesting in this part is welcomed to share you ideas also. > > > > Thanks. > > Manhua > > > > On 2019/11/04 09:15:35, Jacky Li <[email protected]> wrote: > > > Hi Manhua, > > > > > > +1 for this feature. > > > > > > One question: > > > Since one column chunk in one blocklet is carbon's minimum IO unit, why > > > not > > > create bloom filter in blocklet level? If it is page level, we still need > > > to > > > read page data into memory, the saving is only for decompression. > > > > > > > > > Regards, > > > Jacky > > > > > > > > > > > > -- > > > Sent from: > > > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ > > > > > >
