[jira] [Updated] (DRILL-2553) Cost calculation fails to properly choose single file scan in favor of a multi-file scan when files are small

Jason Altekruse (JIRA) Tue, 24 Mar 2015 18:56:04 -0700

     [ 
https://issues.apache.org/jira/browse/DRILL-2553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jason Altekruse updated DRILL-2553:
-----------------------------------
    Description: There is a failing test case in the patch for constant folding 
that should be checked in soon. The test attempts to prune out one directory of 
a scan after a constant expression returning the name of a directory is folded, 
but the files being read from both directories are very small. Our current 
method of calculating cost makes the pruned and unpruned plans report the same 
cost. This could be fixed in a few different locations, 
EasyGroupScan.getScanStats() being used here could factor the file count into 
its calculation of the total row count. We also could move to a two part metric 
to track the number of files, instead of just an estimated row count. This 
would require some changes in the cost calculation of the scan rels themselves 
which use the information from the scan stats. I think in general we should 
consider solving this as high up as possible, as we want to make as optimal 
cost estimates as possible, even if the information provided from storage 
plugins is not completely accurate. For example, even disregarding the row 
count reported by EasyGroupScan, the rel nodes have knowledge of the number of 
partitions. It seems like at this level we should be able to avoid picking the 
plan that has a superset of the partitions of the other possible plan.  (was: 
There is a failing test case in the patch for constant folding that should be 
checked in soon. The test attempts to prune out one directory of a scan after a 
constant expression returning the name of a directory is folded, but the files 
being read from both directories are very small. Our current method of 
calculating cost makes the pruned and unpruned plans report the same cost. This 
could be fixed in a few different locations, EasyGroupScan.getScanStats() being 
used here could factor the file count into its calculation of the total row 
count. We also could move to a two part metric to track the number of files, 
instead of just an estimated row count. This would require some changes in the 
cost calculation of the scan rels themselves which use the information from the 
scan stats. I think in general we should consider solving this as high up as 
possible, as we want to make as optimal cost estimates as possible, even if the 
information provided from storage plugins is not completely accurate.)

> Cost calculation fails to properly choose single file scan in favor of a 
> multi-file scan when files are small
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: DRILL-2553
>                 URL: https://issues.apache.org/jira/browse/DRILL-2553
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Query Planning & Optimization
>    Affects Versions: 0.8.0
>            Reporter: Jason Altekruse
>            Assignee: Aman Sinha
>
> There is a failing test case in the patch for constant folding that should be 
> checked in soon. The test attempts to prune out one directory of a scan after 
> a constant expression returning the name of a directory is folded, but the 
> files being read from both directories are very small. Our current method of 
> calculating cost makes the pruned and unpruned plans report the same cost. 
> This could be fixed in a few different locations, 
> EasyGroupScan.getScanStats() being used here could factor the file count into 
> its calculation of the total row count. We also could move to a two part 
> metric to track the number of files, instead of just an estimated row count. 
> This would require some changes in the cost calculation of the scan rels 
> themselves which use the information from the scan stats. I think in general 
> we should consider solving this as high up as possible, as we want to make as 
> optimal cost estimates as possible, even if the information provided from 
> storage plugins is not completely accurate. For example, even disregarding 
> the row count reported by EasyGroupScan, the rel nodes have knowledge of the 
> number of partitions. It seems like at this level we should be able to avoid 
> picking the plan that has a superset of the partitions of the other possible 
> plan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (DRILL-2553) Cost calculation fails to properly choose single file scan in favor of a multi-file scan when files are small

Reply via email to