Re: [Propose] More data skipping technology for IO intensive performance enhancement

Lei Chang Wed, 13 Jul 2016 18:04:39 -0700

looks we get consensus on this enhancement, i just started a JIRA to track
this: https://issues.apache.org/jira/browse/HAWQ-923


and I also added this to performance enhancement on roadmap page

Cheers
Lei


On Mon, Jul 11, 2016 at 2:00 PM, Ming Li <[email protected]> wrote:

> It seems the Dynamic partition pruning in impala is different from the DPE
> (dynamic partition elimination) in HAWQ, below is the feature description
> from impala roadmap (http://impala.io/overview.html).
>
>
>    - Dynamic partition pruning - to perform data elimination of queries
>    where the partition filters are in dimension tables instead of the fact
>    tables
>
>
> On Fri, Jul 8, 2016 at 9:56 PM, Ruilong Huo <[email protected]> wrote:
>
> > Strong agree with Ming's proposal.
> >
> > We do have DPE (dynamic partition elimination) in HAWQ. But it is a kind
> of
> > high level skipping which is conducted at planning phase.
> > If fine-grained filter can be done at runtime in execution phase, there
> > might be more performance gain for I/O intensive workload.
> >
> > Looking forward to see a plan for it soon:)
> >
> > Best regards,
> > Ruilong Huo
> >
> > On Fri, Jul 8, 2016 at 7:02 AM, Ivan Weng <[email protected]> wrote:
> >
> > > Thanks Ming, data skipping technology is really what HAWQ needed.
> > > Hope to see this design and maybe prototype soon.
> > >
> > > On Thu, Jul 7, 2016 at 10:33 AM, Wen Lin <[email protected]> wrote:
> > >
> > > > Thanks for sharing with us!
> > > > It's really a good investigation and proposal.
> > > > Iooking forward to a design draft.
> > > >
> > > > On Thu, Jul 7, 2016 at 10:16 AM, Lili Ma <[email protected]> wrote:
> > > >
> > > > > What about we work out a draft design describing how to implement
> > data
> > > > > skipping technology for HAWQ?
> > > > >
> > > > >
> > > > > Thanks
> > > > > Lili
> > > > >
> > > > > On Wed, Jul 6, 2016 at 7:23 PM, Gmail <[email protected]>
> wrote:
> > > > >
> > > > > > BTW, could you create some related issues in JIRA?
> > > > > >
> > > > > > Thanks
> > > > > > xunzhang
> > > > > >
> > > > > > Send from my iPhone
> > > > > >
> > > > > > > 在 2016年7月2日，23:19，Ming Li <[email protected]> 写道：
> > > > > > >
> > > > > > > Data skipping technology can extremely avoiding unnecessary IO,
> > so
> > > > it
> > > > > > can
> > > > > > > extremely enhance performance for IO intensive query. Including
> > > > > > eliminating
> > > > > > > query on unnecessary table partition according to the partition
> > key
> > > > > > range ,
> > > > > > > I think more options are available now:
> > > > > > >
> > > > > > > (1) Parquet / ORC format introduce a lightweight meta data info
> > > like
> > > > > > > Min/Max/Bloom filter for each block, such meta data can be
> > > exploited
> > > > > when
> > > > > > > predicate/filter info can be fetched before executing scan.
> > > > > > >
> > > > > > > However now in HAWQ, all data in parquet need to be scanned
> into
> > > > memory
> > > > > > > before processing predicate/filter. We don't generate the meta
> > info
> > > > > when
> > > > > > > INSERT into parquet table, the scan executor doesn't utilize
> the
> > > meta
> > > > > > info
> > > > > > > neither. Maybe some scan API need to be refactored so that we
> can
> > > get
> > > > > > > predicate/filter
> > > > > > > info before executing base relation scan.
> > > > > > >
> > > > > > > (2) Base on (1) technology,  especially with Bloom filter, more
> > > > > optimizer
> > > > > > > technology can be explored furthur. E.g. Impala implemented
> > Runtime
> > > > > > > filtering(*
> > > > > >
> > > > >
> > > >
> > >
> >
> https://www.cloudera.com/documentation/enterprise/latest/topics/impala_runtime_filtering.html
> > > > > > > <
> > > > > >
> > > > >
> > > >
> > >
> >
> https://www.cloudera.com/documentation/enterprise/latest/topics/impala_runtime_filtering.html
> > > > > > >*
> > > > > > > ),  which can be used at
> > > > > > > - dynamic partition pruning
> > > > > > > - converting join predicate to base relation predicate
> > > > > > >
> > > > > > > It tell the executor to wait for one moment(the interval time
> can
> > > be
> > > > > set
> > > > > > in
> > > > > > > guc) before executing base relation scan, if the interested
> > > > values(e.g.
> > > > > > the
> > > > > > > column in join predicate only have very small set) arrived in
> > time,
> > > > it
> > > > > > can
> > > > > > > use these value to filter this scan, if doesn't arrived in
> time,
> > it
> > > > > scan
> > > > > > > without this filter, which doesn't impact result correctness.
> > > > > > >
> > > > > > > Unlike (1) technology, this technology cannot be used in any
> > case,
> > > it
> > > > > > only
> > > > > > > outperform in some cases. So it just add some more query plan
> > > > > > > choices/paths, and the optimizer need based on statistics info
> to
> > > > > > calculate
> > > > > > > the cost, and apply it when cost down.
> > > > > > >
> > > > > > > All in one, maybe more similar technology can be adoptable for
> > HAWQ
> > > > > now,
> > > > > > > let's start to think about performance related technology,
> > moreover
> > > > we
> > > > > > need
> > > > > > > to instigate how these technology can be implemented in HAWQ.
> > > > > > >
> > > > > > > Any ideas or suggestions are welcomed? Thanks.
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [Propose] More data skipping technology for IO intensive performance enhancement

Reply via email to