[kudu-CR] KUDU-2921: Exposing the table statistics to spark relation.

Andrew Wong (Code Review) Thu, 22 Aug 2019 23:53:07 -0700

Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14107 )


Change subject: KUDU-2921: Exposing the table statistics to spark relation.
......................................................................


Patch Set 2:

> Patch Set 2:
>
> > Actually, I have considered to put those statistics into existing message 
> > such as getShema which will be used when opening table. But as for the 
> > table statistics may contain more and more entries it may be better to be a 
> > independent rpc. What's more, because this statistics may change when 
> > client trying to write, so I think it will be useful to support a rpc for 
> > client to get the statistics change. Although for query plan it will only 
> > use once and save one round trip if we put it into existing rpc but it will 
> > be heavy if we want to get statistic change in other use cases.
>
> How are the stats useful for a writing client (i.e. an Impala or Spark 
> executor) vs. the query planner?
>
> What are the other use cases you're thinking about? If they're hypothetical, 
> we could add the stats to GetTableLocations now, and also expose them via new 
> RPC later, when those other use cases materialize.
>
> >  > If doing this, let's make sure to make it optional, since the
> >  > statistics could become larger later (eg with things like
> >  > histograms, NDVs per column, etc)
> >
> > Will make it optional if do that :)
>
> Agreed.

Here are some stats that systems use:
- size in bytes
- number of rows
- per-column stats
  - distinct count
  - min/max/histogram

The per-column stats seem like they wouldn't be a good fit for the GetSchema 
endpoint, since they may end up returning user data.

So I'm hesitant about adding stats in with GetSchema; from an authorization 
point of view, getting the number of rows in a table, for instance, currently 
requires some scan privileges. If we tie it into the GetSchema RPC, this is no 
longer the case.

Some info about Spark:
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-CatalogStatistics.html
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-ColumnStat.html


--
To view, visit http://gerrit.cloudera.org:8080/14107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Gerrit-Change-Number: 14107
Gerrit-PatchSet: 2
Gerrit-Owner: ZhangYao <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: Grant Henke <[email protected]>
Gerrit-Reviewer: Hao Hao <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-Reviewer: ZhangYao <[email protected]>
Gerrit-Comment-Date: Fri, 23 Aug 2019 06:52:43 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2921: Exposing the table statistics to spark relation.

Reply via email to