ZhangYao has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14107 )

Change subject: KUDU-2921: Exposing the table statistics to spark relation.
......................................................................


Patch Set 2:

(11 comments)

> Did you consider adding the statistics to the existing
 > GetTableLocations RPC response instead of creating a brand new RPC?
 > Query planners and job drivers already use it to open tables, so
 > it'd mean one less round trip to get the information.
Actually, I have considered to put those statistics into existing message such 
as getShema which will be used when opening table. But as for the table 
statistics may contain more and more entries it may be better to be a 
independent rpc. What's more, because this statistics may change when client 
trying to write, so I think it will be useful to support a rpc for client to 
get the statistics change. Although for query plan it will only use once and 
save one round trip if we put it into existing rpc but it will be heavy if we 
want to get statistic change in other use cases.

 > > Patch Set 1:
 > >
 > > Did you consider adding the statistics to the existing
 > GetTableLocations RPC response instead of creating a brand new RPC?
 > Query planners and job drivers already use it to open tables, so
 > it'd mean one less round trip to get the information.
 >
 > If doing this, let's make sure to make it optional, since the
 > statistics could become larger later (eg with things like
 > histograms, NDVs per column, etc)

Will make it optional if do that :)

http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
File java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java:

http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@816
PS1, Line 816:     GetTableStatisticsRequest rpc = new 
GetTableStatisticsRequest(this.masterTable,
> Are you planning to expose this in the c++ client too?
Yes, will do this :) maybe next part.


http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsRequest.java
File 
java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsRequest.java:

http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsRequest.java@11
PS1, Line 11: class GetTableStatisticsRequest extends 
KuduRpc<GetTableStatisticsResponse> {
> This can be package private.
Done


http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsResponse.java
File 
java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsResponse.java:

http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsResponse.java@8
PS1, Line 8:   private final long onDiskSize;
> If you see my other comment on defining a KuduTableStatistics object, then
Done


http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/main/java/org/apache/kudu/client/KuduClient.java
File java/kudu-client/src/main/java/org/apache/kudu/client/KuduClient.java:

http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/main/java/org/apache/kudu/client/KuduClient.java@227
PS1, Line 227:   public KuduTableStatistics getTableStatistics(String name) 
throws KuduException {
> Does it make sense for this method to be on the KuduTable object? The api w
I have implemented this method in KuduTable. Done.


http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/test/java/org/apache/kudu/client/TestKuduTable.java
File java/kudu-client/src/test/java/org/apache/kudu/client/TestKuduTable.java:

http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/test/java/org/apache/kudu/client/TestKuduTable.java@811
PS1, Line 811:
> Is there a more deterministic way to ensure this happens? Using sleep often
Currently I don't find a way to trigger the heartbeat between tserver and 
master from java client side(This may need new rpc to implement the control). 
Here I use update_tablet_stats_interval_ms and heartbeat_interval_ms to control 
the heartbeat frequency and wait  enought time for the statistics to be 
collected on master. If the heartbeat and statistics work correctly it supposed 
to be right.
Whatever it still flaky somehow. Anyone have other advices?


http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/test/java/org/apache/kudu/client/TestKuduTable.java@814
PS1, Line 814:     Thread.sleep(200 * 6);
> Can you test some edge cases? Maybe call this with 0 rows, and insert rows
0 rows has been test in this test. As for test inserting row increasingly, it 
can do but the statistics are suppose to lagging so maybe test the 
monotonically increasing of statistics and the final accuracy :)


http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
File 
java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala:

http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala@263
PS1, Line 263:   private val sizeFactor: Float = 1.0F
> What's the purpose of the size factor here?
Actually, because our on_disk_size is the size after compression so the 
estimation of size will be too small.  As for spark plan, sizeInBytes is always 
used to help selecting the suitable join method and according to its comment it 
is better to overestimate size than underestimate. This sizeFactor is the param 
to adjust the  estimation but its current value(1.0) is experimental.


http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala@267
PS1, Line 267:   override def sizeInBytes: Long = (math.max(onDiskSize, 0) * 
sizeFactor).toLong
> Shouldn't this size take the projection and predicates into consideration?
1. It supposed to do that but currently we don't have column statistics. We 
only have table statistics so it's hard to take the userSchema into 
consideration. Maybe we can roughly cut the size with how many columns we 
selected but it may underestimate the relation size. Here I use the total 
on_disk_size is conservative but safe.

If we use live row size to estimate the sizeInByte we can have more accurate 
estimation because we can cut those columns we doesn't select. However, we can 
not know the size of binary/string column(It has max_cell_size_bytes as 64kb in 
default but it's too large to do the estimation).

2. Will do it on test, Done :)


http://gerrit.cloudera.org:8080/#/c/14107/1/src/kudu/master/catalog_manager.h
File src/kudu/master/catalog_manager.h:

http://gerrit.cloudera.org:8080/#/c/14107/1/src/kudu/master/catalog_manager.h@638
PS1, Line 638:   Status GetTableStatistics(const GetTableStatisticsRequestPB* 
req,
> Can you add a test for this?
This may be tested in master_service.cc because the test in master_service.cc 
will call this functions.


http://gerrit.cloudera.org:8080/#/c/14107/1/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

http://gerrit.cloudera.org:8080/#/c/14107/1/src/kudu/master/catalog_manager.cc@2906
PS1, Line 2906:       table->set_name(table_name);
> I'm not sure these required permissions are correct for protecting this end
Actually, this metrics doesn't contained in tablemetadata and  I think it may 
just like metadata somehow so I use this  authorization. I invited hao hao to 
review this :)


http://gerrit.cloudera.org:8080/#/c/14107/1/src/kudu/master/master_service.cc
File src/kudu/master/master_service.cc:

http://gerrit.cloudera.org:8080/#/c/14107/1/src/kudu/master/master_service.cc@415
PS1, Line 415: void MasterServiceImpl::GetTableStatistics(const 
GetTableStatisticsRequestPB* req,
> Can you add a test for this? Maybe in master-test.cc
Done



--
To view, visit http://gerrit.cloudera.org:8080/14107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Gerrit-Change-Number: 14107
Gerrit-PatchSet: 2
Gerrit-Owner: ZhangYao <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: Grant Henke <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-Reviewer: ZhangYao <[email protected]>
Gerrit-Comment-Date: Thu, 22 Aug 2019 08:01:11 +0000
Gerrit-HasComments: Yes

Reply via email to