[jira] [Created] (IMPALA-7608) Estimate row count from file size when no stats available

Paul Rogers (JIRA) Fri, 21 Sep 2018 15:29:25 -0700

Paul Rogers created IMPALA-7608:
-----------------------------------

             Summary: Estimate row count from file size when no stats available
                 Key: IMPALA-7608
                 URL: https://issues.apache.org/jira/browse/IMPALA-7608
             Project: IMPALA
          Issue Type: Improvement
          Components: Frontend
    Affects Versions: Impala 3.0
            Reporter: Paul Rogers
            Assignee: Paul Rogers



Impala makes heavy use of stats, which is a good thing. Stats feed into query 
planning where they allow the planner to choose among a fixed set of 
alternatives such as: do I put t1 on the build or probe side of a join?

Because the planner decisions tend to be discrete, we only need enough 
information to decide whether to do A or B (or, more generally, to choose among 
a set of choices A, B, C, ... N).

Often data sizes are vastly different on different paths. Stats help refine 
these numbers, but much of the information just needs to be in the ball park: 
is table t1 larger or smaller than t2? Often, one table is much larger than the 
other, so even a rough size estimate will force the right decision (put the 
smaller table on the build side of a join.)

Today, if Impala has no stats, it refuses to even consider table size. Consider 
the following unit test:

{noformat}
    runTest("SELECT a FROM functional.tinytable;", -1);
{noformat}

This plans the given query, then verifies that the expected result cardinality 
is the number given. In this case, {{tinytable}} has no stats. So, we don't 
know the cardinality. OK...

The table turns out to be 3 rows. Perhaps I join this to a hypothetical 
{{hugetable}} of 1 million rows. Without even a guess at cardinality, Impala 
can't choose a good plan.

The suggestion is to use table size to estimate row cardinality. Come up with 
some assumed row width, say 100. Then, estimate row count as {{file size / est. 
row width}}. This gives a ballpark number that would be plenty good for the 
planner to choose the proper plan much of the time. 

Since this is such an easy estimate to make, and will address the occasional 
case in which stats are not available, it seems a shame to not take advantage 
of this information.

In terms of implementation, {{FeFsTable.Utils.getExtrapolatedNumRows()}} 
already does some extrapolation, if enabled. It can be extended to do the 
last-ditch extrapolation suggested above.

If we apply this simple fix in a prototype build, the new test result is closer 
to reality:

{noformat}
    runTest("SELECT a FROM functional.tinytable;", 1);
{noformat}

Given that the fix is so simple, any reason not to use the file size, when 
available? Is 100 a reasonable assumed row width? Should this functionality 
always be on, not just when enabled using the back-end config?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Created] (IMPALA-7608) Estimate row count from file size when no stats available

Reply via email to