[
https://issues.apache.org/jira/browse/IMPALA-7608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sahil Takiar updated IMPALA-7608:
---------------------------------
Parent: IMPALA-10083
Issue Type: Sub-task (was: Improvement)
> Estimate row count from file size when no stats available
> ---------------------------------------------------------
>
> Key: IMPALA-7608
> URL: https://issues.apache.org/jira/browse/IMPALA-7608
> Project: IMPALA
> Issue Type: Sub-task
> Components: Frontend
> Affects Versions: Impala 3.0
> Reporter: Paul Rogers
> Assignee: Fang-Yu Rao
> Priority: Major
> Fix For: Impala 3.3.0, Impala 3.4.0
>
>
> Impala makes heavy use of stats, which is a good thing. Stats feed into query
> planning where they allow the planner to choose among a fixed set of
> alternatives such as: do I put t1 on the build or probe side of a join?
> Because the planner decisions tend to be discrete, we only need enough
> information to decide whether to do A or B (or, more generally, to choose
> among a set of choices A, B, C, ... N).
> Often data sizes are vastly different on different paths. Stats help refine
> these numbers, but much of the information just needs to be in the ball park:
> is table t1 larger or smaller than t2? Often, one table is much larger than
> the other, so even a rough size estimate will force the right decision (put
> the smaller table on the build side of a join.)
> Today, if Impala has no stats, it refuses to even consider table size.
> Consider the following unit test:
> {noformat}
> runTest("SELECT a FROM functional.tinytable;", -1);
> {noformat}
> This plans the given query, then verifies that the expected result
> cardinality is the number given. In this case, {{tinytable}} has no stats.
> So, we don't know the cardinality. OK...
> The table turns out to be 3 rows. Perhaps I join this to a hypothetical
> {{hugetable}} of 1 million rows. Without even a guess at cardinality, Impala
> can't choose a good plan.
> The suggestion is to use table size to estimate row cardinality. Come up with
> some assumed row width, say 100. Then, estimate row count as {{file size /
> est. row width}}. This gives a ballpark number that would be plenty good for
> the planner to choose the proper plan much of the time.
> Since this is such an easy estimate to make, and will address the occasional
> case in which stats are not available, it seems a shame to not take advantage
> of this information.
> In terms of implementation, {{HdfsScanNode.computeCardinalities()}} already
> uses some extrapolation, if enabled. It can be extended to do the last-ditch
> extrapolation suggested above if, after the current techniques, the
> cardinality is still undefined.
> If we apply this simple fix in a prototype build, the new test result is
> closer to reality:
> {noformat}
> runTest("SELECT a FROM functional.tinytable;", 1);
> {noformat}
> Given that the fix is so simple, any reason not to use the file size, when
> available? Is 100 a reasonable assumed row width? Should this functionality
> always be on, not just when enabled using the back-end config?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]