Not sure what you mean by “depict”.

If you want a description of the statistics, read their documentation. 
Hopefully it’s clear which statistics are exact and which are not.

Obviously we’d all like statistics to be exact. But for many statistics, it’s 
impossible to get exact answer unless you actually execute the query. For 
example, the row count. So, such statistics are useless for purposes of query 
optimization.



> On May 7, 2018, at 2:23 AM, [email protected] wrote:
> 
> Can you depict which column statistics should be exact and which are all can 
> be approximate to get a decent plan?
> 
> On 2018/05/04 05:31:30, [email protected] <[email protected]> 
> wrote: 
>> Yes ColStatistics is in Hive and it holds all statistics about the columns. 
>> 
>> On 2018/05/03 16:26:02, Julian Hyde <[email protected]> wrote: 
>>> It depends on the statistic. Most of them are approximate.
>>> 
>>> It’s the "garbage in, garbage out" principle. An exact statistic may be of 
>>> a bit more (or a lot more) use to the consumer of the statistic, but is 
>>> more effort for the producer of the statistic.
>>> 
>>> RelMdMaxRowCount is one of the few exact ones. If RelMdMaxRowCount says 10, 
>>> the relation might return 0 rows or 9 rows or 10 rows but never 11 rows.
>>> 
>>> RelMdPredicates and is also exact (albeit not numeric). RelMdUniqueKeys is 
>>> exact (which is to say, it returns a key, it is definitely unique; there 
>>> may be some unique keys that it does not know about).
>>> 
>>> I don’t know what ColStatistics is. Is it a Hive thing? I surmise that is 
>>> it based on RelMdRowCount, which is approximate.
>>> 
>>> Julian
>>> 
>>> 
>>>> On May 3, 2018, at 5:41 AM, Valli Annamalai <[email protected]> 
>>>> wrote:
>>>> 
>>>> In Hive, column statistics like countDistinct, isPrimaryKey, etc.are need
>>>> to be set. While doing so, in Hive, the following function sets primary key
>>>> to true based on a assumption.
>>>> 
>>>> 
>>>>   public static void inferAndSetPrimaryKey(long numRows,
>>>> List<ColStatistics> colStats) {
>>>>       if (colStats != null) {
>>>>         for (ColStatistics cs : colStats) {
>>>>           if (cs != null && cs.getCountDistint() >= numRows) {
>>>>             cs.setPrimaryKey(true);
>>>>           }
>>>>           else if (cs != null && cs.getRange() != null &&
>>>> cs.getRange().minValue != null &&
>>>>               cs.getRange().maxValue != null) {
>>>>             if (numRows ==
>>>>                 ((cs.getRange().maxValue.longValue() -
>>>> cs.getRange().minValue.longValue()) + 1)) {
>>>>               cs.setPrimaryKey(true);
>>>>             }
>>>>           }
>>>>         }
>>>>       }
>>>>     }
>>>> 
>>>> If this is the case, considering I have only 2 values filled over the
>>>> entire column, which are 1 and 1000, and 1000 is the numRows, then having
>>>> primary key as true would be wrong. While planning, if suppose aggregation
>>>> is the upcoming node, then that node need not be proceeded, considering
>>>> primary key column will have only unique values.
>>>> 
>>>> If we are assuming as above function to set primary key and if calcite also
>>>> proceed with these assumptions, then the result will also be wrong. So how
>>>> this could be solved?
>>>> 
>>>> Similarly for count distinct also, is it okay to give approximate values to
>>>> calcite?
>>> 
>>> 
>> 

Reply via email to