I agree with Jingsong. There are use cases to get partitions and partition stats in a single call to reduce the IO cost. For example, extending Catalog#listPartitions to Catalog#listPartitionsWithStats, and extending Catalog#listPartitionsByFilter to Catalog#listPartitionsWithStatsByFilter. This allows the planner to choose the cheapest one to request partitions and stats. But of course, this can be introduced in the future.
Besides, I found that the FLIP doesn't cover compatibility properly. Introducing two new methods in Catalog will break all third-party catalog implementations. On the other hand, the planner should have a way to identify whether the Catalog supports bulk get. Best, Jark On Wed, 20 Jul 2022 at 16:43, Jingsong Li <jingsongl...@gmail.com> wrote: > Thanks for your reply. > > - Consider bulkGetPartitionStatistics, partition statistics are > already in HiveMetastoreClient.listPartitions. But on our side, we > need Catalog.getPartitions first, and then > Catalog.bulkGetPartitionStatistics. > > - Consider bulkGetPartitionColumnStatistics, yes, as you said, we need it. > > But to unify them, the current FLIP is acceptable. > > Best, > Jingsong > > On Wed, Jul 20, 2022 at 3:57 PM Jing Ge <j...@ververica.com> wrote: > > > > Hi Jingsong, > > > > Thanks for clarifying it. Are you suggesting a new method or changing the > > name of the methods described in the FLIP? > > Please see my answers and further questions below. > > > > Best regards, > > Jing > > > > On Wed, Jul 20, 2022 at 4:28 AM Jingsong Li <jingsongl...@gmail.com> > wrote: > > > > > Hi Jing, > > > > > > I understand that the statistics for partitions are currently only > > > used by Hive, so we can look at the Hive implementation: > > > > > > See HiveCatalog.getPartitionStatistics. > > > To get the statistics, we actually get them from the > > > org.apache.hadoop.hive.metastore.api.Partition object. > > > > > > > Correct. Both new methods will do the same thing but in bulk mode. > > > > > > > > > > According to HiveMetastore's API, partition-related operations > > > actually get the partition as well as the statistics information. > > > > > > > I am not really sure which API it is. Methods in HiveMetaStoreClient that > > are dealing with partitions will either return partitions or > statisticObjs, > > e.g. listPartitions(...) or getPartitionColumnStatistics(...) > > > > > > > > > > So if the current partition statistics are just for Hive, can we > > > consider unifying it with Hive? > > > > > > > Yes, we are on the same page. This FLIP is trying to unify it with > > HiveMetaStoreClient. > > > > > > > > > > For example, in PushPartitionIntoTableSourceScanRule, just use > > > `listPartitionWithStats`, and adjust table statistics from partitions. > > > > > > > Does the bulkGetPartitionStatistics work in this case too? Or, do you > mean > > you need both partitions and related statistics returned by a new method > > called `listPartitionWithStats`? > > > > > > > Best, > > > Jingsong > > > > > > On Tue, Jul 19, 2022 at 8:44 PM Jing Ge <j...@ververica.com> wrote: > > > > > > > > Thanks Jingsong for the suggestion. > > > > > > > > Do you mean using a different naming convention? There is a thought > and > > > > description in the FLIP about using "list" or "bulkGet": > > > > > > > > - bulkGetPartitionStatistics(...) has been chosen over > > > > listPartitionStatistics(...), because, comparing to database and > > > partition > > > > that are static and can be listed, statistics are more dynamic and > > > will > > > > need more computation logic to create, therefore using "get" is > > > > semantically more feasible than list. The "bulk" gives users the > hint > > > that > > > > this method will work in the bulk mode and return a collection of > > > instances. > > > > > > > > > > > > As a reference, we can see that no method in MetaStoreClient, that > > > > calculates statistics, uses the "list" naming convention. > > > > > > > > Best regards, > > > > Jing > > > > > > > > On Fri, Jul 15, 2022 at 5:38 AM Jingsong Li <jingsongl...@gmail.com> > > > wrote: > > > > > > > > > Thanks for starting this discussion. > > > > > > > > > > Have we considered introducing a listPartitionWithStats() in > Catalog? > > > > > > > > > > Best, > > > > > Jingsong > > > > > > > > > > On Fri, Jul 15, 2022 at 10:08 AM Jark Wu <imj...@gmail.com> wrote: > > > > > > > > > > > > Hi Jing, > > > > > > > > > > > > Thanks for starting this discussion. The bulk fetch is a great > > > > > improvement > > > > > > for the optimizer. > > > > > > The FLIP looks good to me. > > > > > > > > > > > > Best, > > > > > > Jark > > > > > > > > > > > > On Fri, 8 Jul 2022 at 17:36, Jing Ge <j...@ververica.com> wrote: > > > > > > > > > > > > > Hi devs, > > > > > > > > > > > > > > After having multiple discussions with Jark and Goldfrey, I'd > like > > > to > > > > > start > > > > > > > a discussion on the mailing list w.r.t. FLIP-247[1], which will > > > > > > > significantly improve the performance by providing the bulk > fetch > > > > > > > capability for table and column statistics. > > > > > > > > > > > > > > Currently the statistics information about tables can only be > > > fetched > > > > > from > > > > > > > the catalog by each given partition iteratively. Since getting > > > > > statistics > > > > > > > information from catalogs is a very heavy operation, in order > to > > > > > improve > > > > > > > the query performance, we’d better provide functionality to > fetch > > > the > > > > > > > statistics information of a table for all given partitions in > one > > > shot. > > > > > > > > > > > > > > Based on the manual performance test, for 2000 partitions, the > cost > > > > > will be > > > > > > > improved from 10s to 2s. The improvement result is 500%. > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-247%3A+Bulk+fetch+of+table+and+column+statistics+for+given+partitions > > > > > > > > > > > > > > Best regards, > > > > > > > Jing > > > > > > > > > > > > > > > >