I agree with Jingsong.

There are use cases to get partitions and partition stats in a single call
to reduce the IO cost.
For example, extending Catalog#listPartitions to
Catalog#listPartitionsWithStats,
and extending Catalog#listPartitionsByFilter to
Catalog#listPartitionsWithStatsByFilter.
This allows the planner to choose the cheapest one to request partitions
and stats.
But of course, this can be introduced in the future.

Besides, I found that the FLIP doesn't cover compatibility properly.
Introducing two new methods in Catalog
will break all third-party catalog implementations. On the other hand, the
planner should
have a way to identify whether the Catalog supports bulk get.

Best,
Jark

On Wed, 20 Jul 2022 at 16:43, Jingsong Li <jingsongl...@gmail.com> wrote:

> Thanks for your reply.
>
> - Consider bulkGetPartitionStatistics, partition statistics are
> already in HiveMetastoreClient.listPartitions. But on our side, we
> need Catalog.getPartitions first, and then
> Catalog.bulkGetPartitionStatistics.
>
> - Consider bulkGetPartitionColumnStatistics, yes, as you said, we need it.
>
> But to unify them, the current FLIP is acceptable.
>
> Best,
> Jingsong
>
> On Wed, Jul 20, 2022 at 3:57 PM Jing Ge <j...@ververica.com> wrote:
> >
> > Hi Jingsong,
> >
> > Thanks for clarifying it. Are you suggesting a new method or changing the
> > name of the methods described in the FLIP?
> > Please see my answers and further questions below.
> >
> > Best regards,
> > Jing
> >
> > On Wed, Jul 20, 2022 at 4:28 AM Jingsong Li <jingsongl...@gmail.com>
> wrote:
> >
> > > Hi Jing,
> > >
> > > I understand that the statistics for partitions are currently only
> > > used by Hive, so we can look at the Hive implementation:
> > >
> > > See HiveCatalog.getPartitionStatistics.
> > > To get the statistics, we actually get them from the
> > > org.apache.hadoop.hive.metastore.api.Partition object.
> > >
> >
> > Correct. Both new methods will do the same thing but in bulk mode.
> >
> >
> > >
> > > According to HiveMetastore's API, partition-related operations
> > > actually get the partition as well as the statistics information.
> > >
> >
> > I am not really sure which API it is. Methods in HiveMetaStoreClient that
> > are dealing with partitions will either return partitions or
> statisticObjs,
> > e.g. listPartitions(...) or getPartitionColumnStatistics(...)
> >
> >
> > >
> > > So if the current partition statistics are just for Hive, can we
> > > consider unifying it with Hive?
> > >
> >
> > Yes, we are on the same page. This FLIP is trying to unify it with
> > HiveMetaStoreClient.
> >
> >
> > >
> > > For example, in PushPartitionIntoTableSourceScanRule, just use
> > > `listPartitionWithStats`, and adjust table statistics from partitions.
> > >
> >
> > Does the bulkGetPartitionStatistics work in this case too? Or, do you
> mean
> > you need both partitions and related statistics returned by a new method
> > called `listPartitionWithStats`?
> >
> >
> > > Best,
> > > Jingsong
> > >
> > > On Tue, Jul 19, 2022 at 8:44 PM Jing Ge <j...@ververica.com> wrote:
> > > >
> > > > Thanks Jingsong for the suggestion.
> > > >
> > > > Do you mean using a different naming convention? There is a thought
> and
> > > > description in the FLIP about using "list" or "bulkGet":
> > > >
> > > >    - bulkGetPartitionStatistics(...) has been chosen over
> > > >    listPartitionStatistics(...), because, comparing to database and
> > > partition
> > > >    that are static and can be listed, statistics are more dynamic and
> > > will
> > > >    need more computation logic to create, therefore using "get" is
> > > >    semantically more feasible than list. The "bulk" gives users the
> hint
> > > that
> > > >    this method will work in the bulk mode and return a collection of
> > > instances.
> > > >
> > > >
> > > > As a reference, we can see that no method in MetaStoreClient, that
> > > > calculates statistics, uses the "list" naming convention.
> > > >
> > > > Best regards,
> > > > Jing
> > > >
> > > > On Fri, Jul 15, 2022 at 5:38 AM Jingsong Li <jingsongl...@gmail.com>
> > > wrote:
> > > >
> > > > > Thanks for starting this discussion.
> > > > >
> > > > > Have we considered introducing a listPartitionWithStats() in
> Catalog?
> > > > >
> > > > > Best,
> > > > > Jingsong
> > > > >
> > > > > On Fri, Jul 15, 2022 at 10:08 AM Jark Wu <imj...@gmail.com> wrote:
> > > > > >
> > > > > > Hi Jing,
> > > > > >
> > > > > > Thanks for starting this discussion. The bulk fetch is a great
> > > > > improvement
> > > > > > for the optimizer.
> > > > > > The FLIP looks good to me.
> > > > > >
> > > > > > Best,
> > > > > > Jark
> > > > > >
> > > > > > On Fri, 8 Jul 2022 at 17:36, Jing Ge <j...@ververica.com> wrote:
> > > > > >
> > > > > > > Hi devs,
> > > > > > >
> > > > > > > After having multiple discussions with Jark and Goldfrey, I'd
> like
> > > to
> > > > > start
> > > > > > > a discussion on the mailing list w.r.t. FLIP-247[1], which will
> > > > > > > significantly improve the performance by providing the bulk
> fetch
> > > > > > > capability for table and column statistics.
> > > > > > >
> > > > > > > Currently the statistics information about tables can only be
> > > fetched
> > > > > from
> > > > > > > the catalog by each given partition iteratively. Since getting
> > > > > statistics
> > > > > > > information from catalogs is a very heavy operation, in order
> to
> > > > > improve
> > > > > > > the query performance, we’d better provide functionality to
> fetch
> > > the
> > > > > > > statistics information of a table for all given partitions in
> one
> > > shot.
> > > > > > >
> > > > > > > Based on the manual performance test, for 2000 partitions, the
> cost
> > > > > will be
> > > > > > > improved from 10s to 2s. The improvement result is 500%.
> > > > > > >
> > > > > > > [1]
> > > > > > >
> > > > > > >
> > > > >
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-247%3A+Bulk+fetch+of+table+and+column+statistics+for+given+partitions
> > > > > > >
> > > > > > > Best regards,
> > > > > > > Jing
> > > > > > >
> > > > >
> > >
>

Reply via email to