ZhangYao has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14107 )

Change subject: KUDU-2921: Exposing the table statistics to spark relation.
......................................................................


Patch Set 2:

> > Actually, I have considered to put those statistics into existing
 > message such as getShema which will be used when opening table. But
 > as for the table statistics may contain more and more entries it
 > may be better to be a independent rpc. What's more, because this
 > statistics may change when client trying to write, so I think it
 > will be useful to support a rpc for client to get the statistics
 > change. Although for query plan it will only use once and save one
 > round trip if we put it into existing rpc but it will be heavy if
 > we want to get statistic change in other use cases.
 >
 > How are the stats useful for a writing client (i.e. an Impala or
 > Spark executor) vs. the query planner?
 >
 > What are the other use cases you're thinking about? If they're
 > hypothetical, we could add the stats to GetTableLocations now, and
 > also expose them via new RPC later, when those other use cases
 > materialize.
 >
 > >  > If doing this, let's make sure to make it optional, since the
 > >  > statistics could become larger later (eg with things like
 > >  > histograms, NDVs per column, etc)
 > >
 > > Will make it optional if do that :)
 >
 > Agreed.

Yeah, they're hypothetical. If we want it put into

 > > Actually, I have considered to put those statistics into existing
 > message such as getShema which will be used when opening table. But
 > as for the table statistics may contain more and more entries it
 > may be better to be a independent rpc. What's more, because this
 > statistics may change when client trying to write, so I think it
 > will be useful to support a rpc for client to get the statistics
 > change. Although for query plan it will only use once and save one
 > round trip if we put it into existing rpc but it will be heavy if
 > we want to get statistic change in other use cases.
 >
 > How are the stats useful for a writing client (i.e. an Impala or
 > Spark executor) vs. the query planner?
 > 
 > What are the other use cases you're thinking about? If they're
 > hypothetical, we could add the stats to GetTableLocations now, and
 > also expose them via new RPC later, when those other use cases
 > materialize.
 >
 > >  > If doing this, let's make sure to make it optional, since the
 > >  > statistics could become larger later (eg with things like
 > >  > histograms, NDVs per column, etc)
 > >
 > > Will make it optional if do that :)
 >
 > Agreed.

Yeah, agree with you :). They are hypothetical. If put it into 
GetTableLocations I want to make sure that it will be called before spark plan 
use sizeInBytes. Currently in KuduRelation, I find GetTableLocations is called 
in buildScan when really scan data in kudu, but sizeInBytes is used before real 
scan. So maybe put it into GetTableLocations we will get empty sizeInBytes. And 
for the same client it has LocationCache, so if we use the same client to read 
between write, it may not update the statistics. But I still need to make sure 
about that :)

 > > Patch Set 2:
 > >
 > > > Actually, I have considered to put those statistics into
 > existing message such as getShema which will be used when opening
 > table. But as for the table statistics may contain more and more
 > entries it may be better to be a independent rpc. What's more,
 > because this statistics may change when client trying to write, so
 > I think it will be useful to support a rpc for client to get the
 > statistics change. Although for query plan it will only use once
 > and save one round trip if we put it into existing rpc but it will
 > be heavy if we want to get statistic change in other use cases.
 > >
 > > How are the stats useful for a writing client (i.e. an Impala or
 > Spark executor) vs. the query planner?
 > >
 > > What are the other use cases you're thinking about? If they're
 > hypothetical, we could add the stats to GetTableLocations now, and
 > also expose them via new RPC later, when those other use cases
 > materialize.
 > >
 > > >  > If doing this, let's make sure to make it optional, since
 > the
 > > >  > statistics could become larger later (eg with things like
 > > >  > histograms, NDVs per column, etc)
 > > >
 > > > Will make it optional if do that :)
 > >
 > > Agreed.
 >
 > Here are some stats that systems use:
 > - size in bytes
 > - number of rows
 > - per-column stats
 > - distinct count
 > - min/max/histogram
 >
 > The per-column stats seem like they wouldn't be a good fit for the
 > GetSchema endpoint, since they may end up returning user data.
 >
 > So I'm hesitant about adding stats in with GetSchema; from an
 > authorization point of view, getting the number of rows in a table,
 > for instance, currently requires some scan privileges. If we tie it
 > into the GetSchema RPC, this is no longer the case.
 >
 > Some info about Spark:
 > https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-CatalogStatistics.html
 > https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-ColumnStat.html

Agree with you and get that, I do think is not suitable to put it into 
GetSchema. :)


--
To view, visit http://gerrit.cloudera.org:8080/14107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Gerrit-Change-Number: 14107
Gerrit-PatchSet: 2
Gerrit-Owner: ZhangYao <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: Grant Henke <[email protected]>
Gerrit-Reviewer: Hao Hao <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-Reviewer: ZhangYao <[email protected]>
Gerrit-Comment-Date: Sat, 24 Aug 2019 17:55:52 +0000
Gerrit-HasComments: No

Reply via email to