ZhangYao has posted comments on this change. ( http://gerrit.cloudera.org:8080/14107 )
Change subject: KUDU-2921: Exposing the table statistics to spark relation. ...................................................................... Patch Set 2: > > Actually, I have considered to put those statistics into existing > message such as getShema which will be used when opening table. But > as for the table statistics may contain more and more entries it > may be better to be a independent rpc. What's more, because this > statistics may change when client trying to write, so I think it > will be useful to support a rpc for client to get the statistics > change. Although for query plan it will only use once and save one > round trip if we put it into existing rpc but it will be heavy if > we want to get statistic change in other use cases. > > How are the stats useful for a writing client (i.e. an Impala or > Spark executor) vs. the query planner? > > What are the other use cases you're thinking about? If they're > hypothetical, we could add the stats to GetTableLocations now, and > also expose them via new RPC later, when those other use cases > materialize. > > > > If doing this, let's make sure to make it optional, since the > > > statistics could become larger later (eg with things like > > > histograms, NDVs per column, etc) > > > > Will make it optional if do that :) > > Agreed. Yeah, they're hypothetical. If we want it put into > > Actually, I have considered to put those statistics into existing > message such as getShema which will be used when opening table. But > as for the table statistics may contain more and more entries it > may be better to be a independent rpc. What's more, because this > statistics may change when client trying to write, so I think it > will be useful to support a rpc for client to get the statistics > change. Although for query plan it will only use once and save one > round trip if we put it into existing rpc but it will be heavy if > we want to get statistic change in other use cases. > > How are the stats useful for a writing client (i.e. an Impala or > Spark executor) vs. the query planner? > > What are the other use cases you're thinking about? If they're > hypothetical, we could add the stats to GetTableLocations now, and > also expose them via new RPC later, when those other use cases > materialize. > > > > If doing this, let's make sure to make it optional, since the > > > statistics could become larger later (eg with things like > > > histograms, NDVs per column, etc) > > > > Will make it optional if do that :) > > Agreed. Yeah, agree with you :). They are hypothetical. If put it into GetTableLocations I want to make sure that it will be called before spark plan use sizeInBytes. Currently in KuduRelation, I find GetTableLocations is called in buildScan when really scan data in kudu, but sizeInBytes is used before real scan. So maybe put it into GetTableLocations we will get empty sizeInBytes. And for the same client it has LocationCache, so if we use the same client to read between write, it may not update the statistics. But I still need to make sure about that :) > > Patch Set 2: > > > > > Actually, I have considered to put those statistics into > existing message such as getShema which will be used when opening > table. But as for the table statistics may contain more and more > entries it may be better to be a independent rpc. What's more, > because this statistics may change when client trying to write, so > I think it will be useful to support a rpc for client to get the > statistics change. Although for query plan it will only use once > and save one round trip if we put it into existing rpc but it will > be heavy if we want to get statistic change in other use cases. > > > > How are the stats useful for a writing client (i.e. an Impala or > Spark executor) vs. the query planner? > > > > What are the other use cases you're thinking about? If they're > hypothetical, we could add the stats to GetTableLocations now, and > also expose them via new RPC later, when those other use cases > materialize. > > > > > > If doing this, let's make sure to make it optional, since > the > > > > statistics could become larger later (eg with things like > > > > histograms, NDVs per column, etc) > > > > > > Will make it optional if do that :) > > > > Agreed. > > Here are some stats that systems use: > - size in bytes > - number of rows > - per-column stats > - distinct count > - min/max/histogram > > The per-column stats seem like they wouldn't be a good fit for the > GetSchema endpoint, since they may end up returning user data. > > So I'm hesitant about adding stats in with GetSchema; from an > authorization point of view, getting the number of rows in a table, > for instance, currently requires some scan privileges. If we tie it > into the GetSchema RPC, this is no longer the case. > > Some info about Spark: > https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-CatalogStatistics.html > https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-ColumnStat.html Agree with you and get that, I do think is not suitable to put it into GetSchema. :) -- To view, visit http://gerrit.cloudera.org:8080/14107 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175 Gerrit-Change-Number: 14107 Gerrit-PatchSet: 2 Gerrit-Owner: ZhangYao <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: Grant Henke <[email protected]> Gerrit-Reviewer: Hao Hao <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-Reviewer: ZhangYao <[email protected]> Gerrit-Comment-Date: Sat, 24 Aug 2019 17:55:52 +0000 Gerrit-HasComments: No
