Re: Are all the statistics given to calcite, need to be exact or approximate?

aishwaryaanns Mon, 07 May 2018 07:17:40 -0700

Can you depict which column statistics should be exact and which are all can be 
approximate to get a decent plan?


On 2018/05/04 05:31:30, [email protected] <[email protected]> 
wrote: 
> Yes ColStatistics is in Hive and it holds all statistics about the columns. 
> 
> On 2018/05/03 16:26:02, Julian Hyde <[email protected]> wrote: 
> > It depends on the statistic. Most of them are approximate.
> > 
> > It’s the "garbage in, garbage out" principle. An exact statistic may be of 
> > a bit more (or a lot more) use to the consumer of the statistic, but is 
> > more effort for the producer of the statistic.
> > 
> > RelMdMaxRowCount is one of the few exact ones. If RelMdMaxRowCount says 10, 
> > the relation might return 0 rows or 9 rows or 10 rows but never 11 rows.
> > 
> > RelMdPredicates and is also exact (albeit not numeric). RelMdUniqueKeys is 
> > exact (which is to say, it returns a key, it is definitely unique; there 
> > may be some unique keys that it does not know about).
> > 
> > I don’t know what ColStatistics is. Is it a Hive thing? I surmise that is 
> > it based on RelMdRowCount, which is approximate.
> > 
> > Julian
> > 
> > 
> > > On May 3, 2018, at 5:41 AM, Valli Annamalai <[email protected]> 
> > > wrote:
> > > 
> > > In Hive, column statistics like countDistinct, isPrimaryKey, etc.are need
> > > to be set. While doing so, in Hive, the following function sets primary 
> > > key
> > > to true based on a assumption.
> > > 
> > > 
> > >    public static void inferAndSetPrimaryKey(long numRows,
> > > List<ColStatistics> colStats) {
> > >        if (colStats != null) {
> > >          for (ColStatistics cs : colStats) {
> > >            if (cs != null && cs.getCountDistint() >= numRows) {
> > >              cs.setPrimaryKey(true);
> > >            }
> > >            else if (cs != null && cs.getRange() != null &&
> > > cs.getRange().minValue != null &&
> > >                cs.getRange().maxValue != null) {
> > >              if (numRows ==
> > >                  ((cs.getRange().maxValue.longValue() -
> > > cs.getRange().minValue.longValue()) + 1)) {
> > >                cs.setPrimaryKey(true);
> > >              }
> > >            }
> > >          }
> > >        }
> > >      }
> > > 
> > > If this is the case, considering I have only 2 values filled over the
> > > entire column, which are 1 and 1000, and 1000 is the numRows, then having
> > > primary key as true would be wrong. While planning, if suppose aggregation
> > > is the upcoming node, then that node need not be proceeded, considering
> > > primary key column will have only unique values.
> > > 
> > > If we are assuming as above function to set primary key and if calcite 
> > > also
> > > proceed with these assumptions, then the result will also be wrong. So how
> > > this could be solved?
> > > 
> > > Similarly for count distinct also, is it okay to give approximate values 
> > > to
> > > calcite?
> > 
> > 
>

Re: Are all the statistics given to calcite, need to be exact or approximate?

Reply via email to