Re: [Propose] More data skipping technology for IO intensive performance enhancement

Wen Lin Wed, 06 Jul 2016 19:33:59 -0700

Thanks for sharing with us!
It's really a good investigation and proposal.
Iooking forward to a design draft.


On Thu, Jul 7, 2016 at 10:16 AM, Lili Ma <[email protected]> wrote:

> What about we work out a draft design describing how to implement data
> skipping technology for HAWQ?
>
>
> Thanks
> Lili
>
> On Wed, Jul 6, 2016 at 7:23 PM, Gmail <[email protected]> wrote:
>
> > BTW, could you create some related issues in JIRA?
> >
> > Thanks
> > xunzhang
> >
> > Send from my iPhone
> >
> > > 在 2016年7月2日，23:19，Ming Li <[email protected]> 写道：
> > >
> > > Data skipping technology can extremely avoiding unnecessary IO,  so it
> > can
> > > extremely enhance performance for IO intensive query. Including
> > eliminating
> > > query on unnecessary table partition according to the partition key
> > range ,
> > > I think more options are available now:
> > >
> > > (1) Parquet / ORC format introduce a lightweight meta data info like
> > > Min/Max/Bloom filter for each block, such meta data can be exploited
> when
> > > predicate/filter info can be fetched before executing scan.
> > >
> > > However now in HAWQ, all data in parquet need to be scanned into memory
> > > before processing predicate/filter. We don't generate the meta info
> when
> > > INSERT into parquet table, the scan executor doesn't utilize the meta
> > info
> > > neither. Maybe some scan API need to be refactored so that we can get
> > > predicate/filter
> > > info before executing base relation scan.
> > >
> > > (2) Base on (1) technology,  especially with Bloom filter, more
> optimizer
> > > technology can be explored furthur. E.g. Impala implemented Runtime
> > > filtering(*
> >
> https://www.cloudera.com/documentation/enterprise/latest/topics/impala_runtime_filtering.html
> > > <
> >
> https://www.cloudera.com/documentation/enterprise/latest/topics/impala_runtime_filtering.html
> > >*
> > > ),  which can be used at
> > > - dynamic partition pruning
> > > - converting join predicate to base relation predicate
> > >
> > > It tell the executor to wait for one moment(the interval time can be
> set
> > in
> > > guc) before executing base relation scan, if the interested values(e.g.
> > the
> > > column in join predicate only have very small set) arrived in time, it
> > can
> > > use these value to filter this scan, if doesn't arrived in time, it
> scan
> > > without this filter, which doesn't impact result correctness.
> > >
> > > Unlike (1) technology, this technology cannot be used in any case, it
> > only
> > > outperform in some cases. So it just add some more query plan
> > > choices/paths, and the optimizer need based on statistics info to
> > calculate
> > > the cost, and apply it when cost down.
> > >
> > > All in one, maybe more similar technology can be adoptable for HAWQ
> now,
> > > let's start to think about performance related technology, moreover we
> > need
> > > to instigate how these technology can be implemented in HAWQ.
> > >
> > > Any ideas or suggestions are welcomed? Thanks.
> >
>

Re: [Propose] More data skipping technology for IO intensive performance enhancement

Reply via email to