Grant Henke has posted comments on this change. ( http://gerrit.cloudera.org:8080/14107 )
Change subject: KUDU-2921: Exposing the table statistics to spark relation. ...................................................................... Patch Set 1: (10 comments) http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java File java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java: http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@816 PS1, Line 816: public Deferred<GetTableStatisticsResponse> getTableStatistics(String name) { Are you planning to expose this in the c++ client too? http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsRequest.java File java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsRequest.java: http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsRequest.java@11 PS1, Line 11: public class GetTableStatisticsRequest extends KuduRpc<GetTableStatisticsResponse> { This can be package private. http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsResponse.java File java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsResponse.java: http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsResponse.java@8 PS1, Line 8: public class GetTableStatisticsResponse extends KuduRpcResponse { If you see my other comment on defining a KuduTableStatistics object, then this could be package private. http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/main/java/org/apache/kudu/client/KuduClient.java File java/kudu-client/src/main/java/org/apache/kudu/client/KuduClient.java: http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/main/java/org/apache/kudu/client/KuduClient.java@227 PS1, Line 227: public GetTableStatisticsResponse getTableStatistics(String name) throws KuduException { Does it make sense for this method to be on the KuduTable object? The api would then be something like `table.getStatistics`. I think the return object here should be a `KuduTableStatistics` object. It will look similar to the GetTableStatisticsResponse object, but decouples the RPC concerns from the domain objects. This pattern can be seen with `KuduClient.openTable` which returns KuduTable instead of GetKuduSchemaResponse. http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/test/java/org/apache/kudu/client/TestKuduTable.java File java/kudu-client/src/test/java/org/apache/kudu/client/TestKuduTable.java: http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/test/java/org/apache/kudu/client/TestKuduTable.java@811 PS1, Line 811: Thread.sleep(500 * 9); Is there a more deterministic way to ensure this happens? Using sleep often leads to flaky tests. http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/test/java/org/apache/kudu/client/TestKuduTable.java@814 PS1, Line 814: statistics = table.getTableStatistics(); Can you test some edge cases? Maybe call this with 0 rows, and insert rows again and check that the value has increased? http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala File java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala: http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala@263 PS1, Line 263: private val sizeFactor: Float = 1.0F What's the purpose of the size factor here? http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala@267 PS1, Line 267: override def sizeInBytes: Long = (math.max(onDiskSize, 0) * sizeFactor).toLong Shouldn't this size take the projection and predicates into consideration? I suppose this would be the worst case size of the relation as is. Is there a way we can test that spark is receiving and using this value correctly here? http://gerrit.cloudera.org:8080/#/c/14107/1/src/kudu/master/catalog_manager.h File src/kudu/master/catalog_manager.h: http://gerrit.cloudera.org:8080/#/c/14107/1/src/kudu/master/catalog_manager.h@638 PS1, Line 638: Status GetTableStatistics(const GetTableStatisticsRequestPB* req, Can you add a test for this? http://gerrit.cloudera.org:8080/#/c/14107/1/src/kudu/master/master_service.cc File src/kudu/master/master_service.cc: http://gerrit.cloudera.org:8080/#/c/14107/1/src/kudu/master/master_service.cc@415 PS1, Line 415: void MasterServiceImpl::GetTableStatistics(const GetTableStatisticsRequestPB* req, Can you add a test for this? Maybe in master-test.cc -- To view, visit http://gerrit.cloudera.org:8080/14107 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175 Gerrit-Change-Number: 14107 Gerrit-PatchSet: 1 Gerrit-Owner: ZhangYao <[email protected]> Gerrit-Reviewer: Grant Henke <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Comment-Date: Tue, 20 Aug 2019 15:55:33 +0000 Gerrit-HasComments: Yes
