Re: [DISCUSS] FLIP-247 Bulk fetch of table and column statistics for given partitions

Jingsong Li Wed, 20 Jul 2022 01:42:57 -0700

Thanks for your reply.

- Consider bulkGetPartitionStatistics, partition statistics are
already in HiveMetastoreClient.listPartitions. But on our side, we
need Catalog.getPartitions first, and then
Catalog.bulkGetPartitionStatistics.


- Consider bulkGetPartitionColumnStatistics, yes, as you said, we need it.

But to unify them, the current FLIP is acceptable.

Best,
Jingsong

On Wed, Jul 20, 2022 at 3:57 PM Jing Ge <[email protected]> wrote:
>
> Hi Jingsong,
>
> Thanks for clarifying it. Are you suggesting a new method or changing the
> name of the methods described in the FLIP?
> Please see my answers and further questions below.
>
> Best regards,
> Jing
>
> On Wed, Jul 20, 2022 at 4:28 AM Jingsong Li <[email protected]> wrote:
>
> > Hi Jing,
> >
> > I understand that the statistics for partitions are currently only
> > used by Hive, so we can look at the Hive implementation:
> >
> > See HiveCatalog.getPartitionStatistics.
> > To get the statistics, we actually get them from the
> > org.apache.hadoop.hive.metastore.api.Partition object.
> >
>
> Correct. Both new methods will do the same thing but in bulk mode.
>
>
> >
> > According to HiveMetastore's API, partition-related operations
> > actually get the partition as well as the statistics information.
> >
>
> I am not really sure which API it is. Methods in HiveMetaStoreClient that
> are dealing with partitions will either return partitions or statisticObjs,
> e.g. listPartitions(...) or getPartitionColumnStatistics(...)
>
>
> >
> > So if the current partition statistics are just for Hive, can we
> > consider unifying it with Hive?
> >
>
> Yes, we are on the same page. This FLIP is trying to unify it with
> HiveMetaStoreClient.
>
>
> >
> > For example, in PushPartitionIntoTableSourceScanRule, just use
> > `listPartitionWithStats`, and adjust table statistics from partitions.
> >
>
> Does the bulkGetPartitionStatistics work in this case too? Or, do you mean
> you need both partitions and related statistics returned by a new method
> called `listPartitionWithStats`?
>
>
> > Best,
> > Jingsong
> >
> > On Tue, Jul 19, 2022 at 8:44 PM Jing Ge <[email protected]> wrote:
> > >
> > > Thanks Jingsong for the suggestion.
> > >
> > > Do you mean using a different naming convention? There is a thought and
> > > description in the FLIP about using "list" or "bulkGet":
> > >
> > >    - bulkGetPartitionStatistics(...) has been chosen over
> > >    listPartitionStatistics(...), because, comparing to database and
> > partition
> > >    that are static and can be listed, statistics are more dynamic and
> > will
> > >    need more computation logic to create, therefore using "get" is
> > >    semantically more feasible than list. The "bulk" gives users the hint
> > that
> > >    this method will work in the bulk mode and return a collection of
> > instances.
> > >
> > >
> > > As a reference, we can see that no method in MetaStoreClient, that
> > > calculates statistics, uses the "list" naming convention.
> > >
> > > Best regards,
> > > Jing
> > >
> > > On Fri, Jul 15, 2022 at 5:38 AM Jingsong Li <[email protected]>
> > wrote:
> > >
> > > > Thanks for starting this discussion.
> > > >
> > > > Have we considered introducing a listPartitionWithStats() in Catalog?
> > > >
> > > > Best,
> > > > Jingsong
> > > >
> > > > On Fri, Jul 15, 2022 at 10:08 AM Jark Wu <[email protected]> wrote:
> > > > >
> > > > > Hi Jing,
> > > > >
> > > > > Thanks for starting this discussion. The bulk fetch is a great
> > > > improvement
> > > > > for the optimizer.
> > > > > The FLIP looks good to me.
> > > > >
> > > > > Best,
> > > > > Jark
> > > > >
> > > > > On Fri, 8 Jul 2022 at 17:36, Jing Ge <[email protected]> wrote:
> > > > >
> > > > > > Hi devs,
> > > > > >
> > > > > > After having multiple discussions with Jark and Goldfrey, I'd like
> > to
> > > > start
> > > > > > a discussion on the mailing list w.r.t. FLIP-247[1], which will
> > > > > > significantly improve the performance by providing the bulk fetch
> > > > > > capability for table and column statistics.
> > > > > >
> > > > > > Currently the statistics information about tables can only be
> > fetched
> > > > from
> > > > > > the catalog by each given partition iteratively. Since getting
> > > > statistics
> > > > > > information from catalogs is a very heavy operation, in order to
> > > > improve
> > > > > > the query performance, we’d better provide functionality to fetch
> > the
> > > > > > statistics information of a table for all given partitions in one
> > shot.
> > > > > >
> > > > > > Based on the manual performance test, for 2000 partitions, the cost
> > > > will be
> > > > > > improved from 10s to 2s. The improvement result is 500%.
> > > > > >
> > > > > > [1]
> > > > > >
> > > > > >
> > > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-247%3A+Bulk+fetch+of+table+and+column+statistics+for+given+partitions
> > > > > >
> > > > > > Best regards,
> > > > > > Jing
> > > > > >
> > > >
> >

Re: [DISCUSS] FLIP-247 Bulk fetch of table and column statistics for given partitions

Reply via email to