looks we get consensus on this enhancement, i just started a JIRA to track this: https://issues.apache.org/jira/browse/HAWQ-923
and I also added this to performance enhancement on roadmap page Cheers Lei On Mon, Jul 11, 2016 at 2:00 PM, Ming Li <[email protected]> wrote: > It seems the Dynamic partition pruning in impala is different from the DPE > (dynamic partition elimination) in HAWQ, below is the feature description > from impala roadmap (http://impala.io/overview.html). > > > - Dynamic partition pruning - to perform data elimination of queries > where the partition filters are in dimension tables instead of the fact > tables > > > On Fri, Jul 8, 2016 at 9:56 PM, Ruilong Huo <[email protected]> wrote: > > > Strong agree with Ming's proposal. > > > > We do have DPE (dynamic partition elimination) in HAWQ. But it is a kind > of > > high level skipping which is conducted at planning phase. > > If fine-grained filter can be done at runtime in execution phase, there > > might be more performance gain for I/O intensive workload. > > > > Looking forward to see a plan for it soon:) > > > > Best regards, > > Ruilong Huo > > > > On Fri, Jul 8, 2016 at 7:02 AM, Ivan Weng <[email protected]> wrote: > > > > > Thanks Ming, data skipping technology is really what HAWQ needed. > > > Hope to see this design and maybe prototype soon. > > > > > > On Thu, Jul 7, 2016 at 10:33 AM, Wen Lin <[email protected]> wrote: > > > > > > > Thanks for sharing with us! > > > > It's really a good investigation and proposal. > > > > Iooking forward to a design draft. > > > > > > > > On Thu, Jul 7, 2016 at 10:16 AM, Lili Ma <[email protected]> wrote: > > > > > > > > > What about we work out a draft design describing how to implement > > data > > > > > skipping technology for HAWQ? > > > > > > > > > > > > > > > Thanks > > > > > Lili > > > > > > > > > > On Wed, Jul 6, 2016 at 7:23 PM, Gmail <[email protected]> > wrote: > > > > > > > > > > > BTW, could you create some related issues in JIRA? > > > > > > > > > > > > Thanks > > > > > > xunzhang > > > > > > > > > > > > Send from my iPhone > > > > > > > > > > > > > 在 2016年7月2日,23:19,Ming Li <[email protected]> 写道: > > > > > > > > > > > > > > Data skipping technology can extremely avoiding unnecessary IO, > > so > > > > it > > > > > > can > > > > > > > extremely enhance performance for IO intensive query. Including > > > > > > eliminating > > > > > > > query on unnecessary table partition according to the partition > > key > > > > > > range , > > > > > > > I think more options are available now: > > > > > > > > > > > > > > (1) Parquet / ORC format introduce a lightweight meta data info > > > like > > > > > > > Min/Max/Bloom filter for each block, such meta data can be > > > exploited > > > > > when > > > > > > > predicate/filter info can be fetched before executing scan. > > > > > > > > > > > > > > However now in HAWQ, all data in parquet need to be scanned > into > > > > memory > > > > > > > before processing predicate/filter. We don't generate the meta > > info > > > > > when > > > > > > > INSERT into parquet table, the scan executor doesn't utilize > the > > > meta > > > > > > info > > > > > > > neither. Maybe some scan API need to be refactored so that we > can > > > get > > > > > > > predicate/filter > > > > > > > info before executing base relation scan. > > > > > > > > > > > > > > (2) Base on (1) technology, especially with Bloom filter, more > > > > > optimizer > > > > > > > technology can be explored furthur. E.g. Impala implemented > > Runtime > > > > > > > filtering(* > > > > > > > > > > > > > > > > > > > > > https://www.cloudera.com/documentation/enterprise/latest/topics/impala_runtime_filtering.html > > > > > > > < > > > > > > > > > > > > > > > > > > > > > https://www.cloudera.com/documentation/enterprise/latest/topics/impala_runtime_filtering.html > > > > > > >* > > > > > > > ), which can be used at > > > > > > > - dynamic partition pruning > > > > > > > - converting join predicate to base relation predicate > > > > > > > > > > > > > > It tell the executor to wait for one moment(the interval time > can > > > be > > > > > set > > > > > > in > > > > > > > guc) before executing base relation scan, if the interested > > > > values(e.g. > > > > > > the > > > > > > > column in join predicate only have very small set) arrived in > > time, > > > > it > > > > > > can > > > > > > > use these value to filter this scan, if doesn't arrived in > time, > > it > > > > > scan > > > > > > > without this filter, which doesn't impact result correctness. > > > > > > > > > > > > > > Unlike (1) technology, this technology cannot be used in any > > case, > > > it > > > > > > only > > > > > > > outperform in some cases. So it just add some more query plan > > > > > > > choices/paths, and the optimizer need based on statistics info > to > > > > > > calculate > > > > > > > the cost, and apply it when cost down. > > > > > > > > > > > > > > All in one, maybe more similar technology can be adoptable for > > HAWQ > > > > > now, > > > > > > > let's start to think about performance related technology, > > moreover > > > > we > > > > > > need > > > > > > > to instigate how these technology can be implemented in HAWQ. > > > > > > > > > > > > > > Any ideas or suggestions are welcomed? Thanks. > > > > > > > > > > > > > > > > > > > > >
