Re: how to represent and optimize a query over partitioned storage

Alessandro Solimando Mon, 13 Dec 2021 02:01:13 -0800

Thank you Jacques and Jing Zhang for the code pointers!

On Mon, 13 Dec 2021 at 03:58, Jing Zhang <[email protected]> wrote:


> Hi,
> partition pruner is a common requirement.
> You could provide a rule which push partition evaluated by filter condition
> into a TableScan.
> There are two ways to ensure the converted result would be chosen in the
> final plan.
> 1. If use CBO planner, you could override the `getRowCount ` by merging all
> the row count of matched partitions into total row count. After apply the
> rule, the total row count must be less than before.
> 2. You could use HBO planner which I thought is proper for this case,
> because we could always get benifite  from partition pruner.
>
> Apache Flink also provides Partitioner pruner, you could find it's
> implementation here [1]. Hope it helps.
>
> [1]
>
> https://github.com/apache/flink/blob/master/flink-table/flink-table-planner/src/main/java/org/apache/flink/table/planner/plan/rules/logical/PushPartitionIntoTableSourceScanRule.java
>
> Best,
> Jing Zhang
>
> Jacques Nadeau <[email protected]> 于2021年12月13日周一 01:57写道：
>
> > I don't have a great pointer to an intelligent arbitrary filter coster.
> The
> > default one in Calcite isn't great (last I checked) as it considers more
> > filters to be equivalent to more reduction. This means it dramatically
> > overestimates set reduction. In Dremio, in simple cases we had to apply
> an
> > upper limit on reduction.
> >
> > For partitions specifically, we actually apply the condition to the
> > underlying partition details to get an accurate new cost. You can see the
> > base class of the pruning here:
> >
> >
> >
> https://github.com/dremio/dremio-oss/blob/master/sabot/kernel/src/main/java/com/dremio/exec/planner/logical/partition/PruneScanRuleBase.java
> >
> > In this code, the problem is decomposed into first doing search argument
> > pruning (sarg) followed by arbitrary expression pruning (using an
> > interpreter).
> >
> > Sorry I don't have a great pointer to a high-quality selectivity
> estimator
> > in OSS that uses more advanced stats. Maybe someone else can point to
> one.
> >
> > On Sat, Dec 11, 2021 at 11:43 PM Alessandro Solimando <
> > [email protected]> wrote:
> >
> > > Do you have any code pointer for achieving that, Jacques?
> > >
> > > My main concern is how to estimate the new cost. Do you leverage the
> > > estimation of predicate selectivity over the partitioning expression
> > maybe?
> > >
> > > Il Dom 12 Dic 2021, 05:48 Jacques Nadeau <[email protected]> ha
> > scritto:
> > >
> > > > What we have done in the past is push filters into a scan and alter
> the
> > > > costing (and estimated row count). In cases where the filter or
> > portions
> > > of
> > > > the filter can be applied against partitioning columns, you prune
> > > > partitions and use a new row count estimate/cost estimate based on
> the
> > > > reduced partition set.
> > > >
> > > >
> > > > On Fri, Dec 10, 2021 at 10:25 AM Maxim Gramin <
> [email protected]
> > >
> > > > wrote:
> > > >
> > > > > I assume that some of the filter conditions (which are involved in
> > the
> > > > > choice of partitions ) may by pushdown'ed to TableScan
> > > > >
> > > > > On Fri, Dec 10, 2021 at 7:29 PM Константин Новиков
> > > > > <[email protected]> wrote:
> > > > >
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > Given some partitioned storage, we can omit the scan of some
> > > partitions
> > > > > > when a filter is present. How can the lower cost of the scan be
> > > > > > represented? As far as I can tell the current approach only
> allows
> > > > > > providing a single cost for the TableScan and Filter can only add
> > to
> > > > > > that. Should my implementation provide a rule that combines
> > > > > > Filter+TableScan?
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: how to represent and optimize a query over partitioned storage

Reply via email to