[
https://issues.apache.org/jira/browse/DRILL-6442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Volodymyr Vysotskyi updated DRILL-6442:
---------------------------------------
Labels: ready-to-commit (was: )
> Adjust Hbase disk cost & row count estimation when filter push down is applied
> ------------------------------------------------------------------------------
>
> Key: DRILL-6442
> URL: https://issues.apache.org/jira/browse/DRILL-6442
> Project: Apache Drill
> Issue Type: Bug
> Affects Versions: 1.13.0
> Reporter: Arina Ielchiieva
> Assignee: Arina Ielchiieva
> Priority: Major
> Labels: ready-to-commit
> Fix For: 1.14.0
>
>
> Disk cost for Hbase scan is calculated based on scan size in bytes.
> {noformat}
> float diskCost = scanSizeInBytes * ((columns == null || columns.isEmpty()) ?
> 1 : columns.size() / statsCalculator.getColsPerRow());
> {noformat}
> Scan size is bytes is estimated using {{TableStatsCalculator}} with the help
> of sampling.
> When we estimate size for the first time (before applying filter push down),
> for sampling we use random rows. When estimating rows after filter push down,
> for sampling we use rows that qualify filter condition. It can happen that
> average row size can be higher after filter push down
> than before. Unfortunately since disk cost depends on these calculations,
> plan with filter push down can give higher cost then without it.
> Possible enhancements:
> 1. Currently default row count is 1 million but if during sampling we return
> less rows then expected, it means that our query will return not more rows
> then this number. We can use this number instead of default row count to
> achieve better cost estimations.
> 2. When filter push down was applied, row number was reduced by half in order
> to ensure plan with filter push down will have less cost. Then same should be
> done for disk cost as well.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)