[jira] [Updated] (DRILL-2553) Cost calculation fails to properly choose single file scan in favor of a multi-file scan when files are small

Aman Sinha (JIRA) Tue, 07 Jul 2015 05:59:33 -0700

     [ 
https://issues.apache.org/jira/browse/DRILL-2553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Aman Sinha updated DRILL-2553:
------------------------------
    Fix Version/s:     (was: 1.2.0)
                   1.3.0

> Cost calculation fails to properly choose single file scan in favor of a 
> multi-file scan when files are small
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: DRILL-2553
>                 URL: https://issues.apache.org/jira/browse/DRILL-2553
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Query Planning & Optimization
>    Affects Versions: 0.8.0
>            Reporter: Jason Altekruse
>            Assignee: Aman Sinha
>             Fix For: 1.3.0
>
>
> There is a failing test case in the patch for constant folding that should be 
> checked in soon. The test attempts to prune out one directory of a scan after 
> a constant expression returning the name of a directory is folded, but the 
> files being read from both directories are very small. Our current method of 
> calculating cost makes the pruned and unpruned plans report the same cost. 
> This could be fixed in a few different locations, 
> EasyGroupScan.getScanStats() being used here could factor the file count into 
> its calculation of the total row count. We also could move to a two part 
> metric to track the number of files, instead of just an estimated row count. 
> This would require some changes in the cost calculation of the scan rels 
> themselves which use the information from the scan stats. I think in general 
> we should consider solving this as high up as possible, as we want to make as 
> optimal cost estimates as possible, even if the information provided from 
> storage plugins is not completely accurate. For example, even disregarding 
> the row count reported by EasyGroupScan, the rel nodes have knowledge of the 
> number of partitions. It seems like at this level we should be able to avoid 
> picking the plan that has a superset of the partitions of the other possible 
> plan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (DRILL-2553) Cost calculation fails to properly choose single file scan in favor of a multi-file scan when files are small

Reply via email to