ZhangYao has posted comments on this change. ( http://gerrit.cloudera.org:8080/14107 )
Change subject: KUDU-2921: Exposing the table statistics to spark relation. ...................................................................... Patch Set 2: (11 comments) > Did you consider adding the statistics to the existing > GetTableLocations RPC response instead of creating a brand new RPC? > Query planners and job drivers already use it to open tables, so > it'd mean one less round trip to get the information. Actually, I have considered to put those statistics into existing message such as getShema which will be used when opening table. But as for the table statistics may contain more and more entries it may be better to be a independent rpc. What's more, because this statistics may change when client trying to write, so I think it will be useful to support a rpc for client to get the statistics change. Although for query plan it will only use once and save one round trip if we put it into existing rpc but it will be heavy if we want to get statistic change in other use cases. > > Patch Set 1: > > > > Did you consider adding the statistics to the existing > GetTableLocations RPC response instead of creating a brand new RPC? > Query planners and job drivers already use it to open tables, so > it'd mean one less round trip to get the information. > > If doing this, let's make sure to make it optional, since the > statistics could become larger later (eg with things like > histograms, NDVs per column, etc) Will make it optional if do that :) http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java File java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java: http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@816 PS1, Line 816: GetTableStatisticsRequest rpc = new GetTableStatisticsRequest(this.masterTable, > Are you planning to expose this in the c++ client too? Yes, will do this :) maybe next part. http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsRequest.java File java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsRequest.java: http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsRequest.java@11 PS1, Line 11: class GetTableStatisticsRequest extends KuduRpc<GetTableStatisticsResponse> { > This can be package private. Done http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsResponse.java File java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsResponse.java: http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsResponse.java@8 PS1, Line 8: private final long onDiskSize; > If you see my other comment on defining a KuduTableStatistics object, then Done http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/main/java/org/apache/kudu/client/KuduClient.java File java/kudu-client/src/main/java/org/apache/kudu/client/KuduClient.java: http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/main/java/org/apache/kudu/client/KuduClient.java@227 PS1, Line 227: public KuduTableStatistics getTableStatistics(String name) throws KuduException { > Does it make sense for this method to be on the KuduTable object? The api w I have implemented this method in KuduTable. Done. http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/test/java/org/apache/kudu/client/TestKuduTable.java File java/kudu-client/src/test/java/org/apache/kudu/client/TestKuduTable.java: http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/test/java/org/apache/kudu/client/TestKuduTable.java@811 PS1, Line 811: > Is there a more deterministic way to ensure this happens? Using sleep often Currently I don't find a way to trigger the heartbeat between tserver and master from java client side(This may need new rpc to implement the control). Here I use update_tablet_stats_interval_ms and heartbeat_interval_ms to control the heartbeat frequency and wait enought time for the statistics to be collected on master. If the heartbeat and statistics work correctly it supposed to be right. Whatever it still flaky somehow. Anyone have other advices? http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/test/java/org/apache/kudu/client/TestKuduTable.java@814 PS1, Line 814: Thread.sleep(200 * 6); > Can you test some edge cases? Maybe call this with 0 rows, and insert rows 0 rows has been test in this test. As for test inserting row increasingly, it can do but the statistics are suppose to lagging so maybe test the monotonically increasing of statistics and the final accuracy :) http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala File java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala: http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala@263 PS1, Line 263: private val sizeFactor: Float = 1.0F > What's the purpose of the size factor here? Actually, because our on_disk_size is the size after compression so the estimation of size will be too small. As for spark plan, sizeInBytes is always used to help selecting the suitable join method and according to its comment it is better to overestimate size than underestimate. This sizeFactor is the param to adjust the estimation but its current value(1.0) is experimental. http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala@267 PS1, Line 267: override def sizeInBytes: Long = (math.max(onDiskSize, 0) * sizeFactor).toLong > Shouldn't this size take the projection and predicates into consideration? 1. It supposed to do that but currently we don't have column statistics. We only have table statistics so it's hard to take the userSchema into consideration. Maybe we can roughly cut the size with how many columns we selected but it may underestimate the relation size. Here I use the total on_disk_size is conservative but safe. If we use live row size to estimate the sizeInByte we can have more accurate estimation because we can cut those columns we doesn't select. However, we can not know the size of binary/string column(It has max_cell_size_bytes as 64kb in default but it's too large to do the estimation). 2. Will do it on test, Done :) http://gerrit.cloudera.org:8080/#/c/14107/1/src/kudu/master/catalog_manager.h File src/kudu/master/catalog_manager.h: http://gerrit.cloudera.org:8080/#/c/14107/1/src/kudu/master/catalog_manager.h@638 PS1, Line 638: Status GetTableStatistics(const GetTableStatisticsRequestPB* req, > Can you add a test for this? This may be tested in master_service.cc because the test in master_service.cc will call this functions. http://gerrit.cloudera.org:8080/#/c/14107/1/src/kudu/master/catalog_manager.cc File src/kudu/master/catalog_manager.cc: http://gerrit.cloudera.org:8080/#/c/14107/1/src/kudu/master/catalog_manager.cc@2906 PS1, Line 2906: table->set_name(table_name); > I'm not sure these required permissions are correct for protecting this end Actually, this metrics doesn't contained in tablemetadata and I think it may just like metadata somehow so I use this authorization. I invited hao hao to review this :) http://gerrit.cloudera.org:8080/#/c/14107/1/src/kudu/master/master_service.cc File src/kudu/master/master_service.cc: http://gerrit.cloudera.org:8080/#/c/14107/1/src/kudu/master/master_service.cc@415 PS1, Line 415: void MasterServiceImpl::GetTableStatistics(const GetTableStatisticsRequestPB* req, > Can you add a test for this? Maybe in master-test.cc Done -- To view, visit http://gerrit.cloudera.org:8080/14107 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175 Gerrit-Change-Number: 14107 Gerrit-PatchSet: 2 Gerrit-Owner: ZhangYao <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: Grant Henke <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-Reviewer: ZhangYao <[email protected]> Gerrit-Comment-Date: Thu, 22 Aug 2019 08:01:11 +0000 Gerrit-HasComments: Yes
