Jason Altekruse created DRILL-2553:
--------------------------------------
Summary: Cost calculation fails to properly choose single file
scan in favor of a multi-file scan when files are small
Key: DRILL-2553
URL: https://issues.apache.org/jira/browse/DRILL-2553
Project: Apache Drill
Issue Type: Bug
Components: Query Planning & Optimization
Affects Versions: 0.8.0
Reporter: Jason Altekruse
Assignee: Aman Sinha
There is a failing test case in the patch for constant folding that should be
checked in soon. The test attempts to prune out one directory of a scan after a
constant expression returning the name of a directory is folded, but the files
being read from both directories are very small. Our current method of
calculating cost makes the pruned and unpruned plans report the same cost. This
could be fixed in a few different locations, EasyGroupScan.getScanStats() being
used here could factor the file count into its calculation of the total row
count. We also could move to a two part metric to track the number of files,
instead of just an estimated row count. This would require some changes in the
cost calculation of the scan rels themselves which use the information from the
scan stats. I think in general we should consider solving this as high up as
possible, as we want to make as optimal cost estimates as possible, even if the
information provided from storage plugins is not completely accurate.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)