Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/14107 )
Change subject: KUDU-2921: Exposing the table statistics to spark relation. ...................................................................... Patch Set 2: > Patch Set 2: > > > Actually, I have considered to put those statistics into existing message > > such as getShema which will be used when opening table. But as for the > > table statistics may contain more and more entries it may be better to be a > > independent rpc. What's more, because this statistics may change when > > client trying to write, so I think it will be useful to support a rpc for > > client to get the statistics change. Although for query plan it will only > > use once and save one round trip if we put it into existing rpc but it will > > be heavy if we want to get statistic change in other use cases. > > How are the stats useful for a writing client (i.e. an Impala or Spark > executor) vs. the query planner? > > What are the other use cases you're thinking about? If they're hypothetical, > we could add the stats to GetTableLocations now, and also expose them via new > RPC later, when those other use cases materialize. > > > > If doing this, let's make sure to make it optional, since the > > > statistics could become larger later (eg with things like > > > histograms, NDVs per column, etc) > > > > Will make it optional if do that :) > > Agreed. Here are some stats that systems use: - size in bytes - number of rows - per-column stats - distinct count - min/max/histogram The per-column stats seem like they wouldn't be a good fit for the GetSchema endpoint, since they may end up returning user data. So I'm hesitant about adding stats in with GetSchema; from an authorization point of view, getting the number of rows in a table, for instance, currently requires some scan privileges. If we tie it into the GetSchema RPC, this is no longer the case. Some info about Spark: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-CatalogStatistics.html https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-ColumnStat.html -- To view, visit http://gerrit.cloudera.org:8080/14107 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175 Gerrit-Change-Number: 14107 Gerrit-PatchSet: 2 Gerrit-Owner: ZhangYao <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: Grant Henke <[email protected]> Gerrit-Reviewer: Hao Hao <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-Reviewer: ZhangYao <[email protected]> Gerrit-Comment-Date: Fri, 23 Aug 2019 06:52:43 +0000 Gerrit-HasComments: No
