[jira] [Commented] (HIVE-24663) Batch process in ColStatsProcessor

mahesh kumar behera (Jira) Thu, 13 May 2021 00:31:06 -0700


    [ 
https://issues.apache.org/jira/browse/HIVE-24663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343755#comment-17343755
 ]


mahesh kumar behera commented on HIVE-24663:
--------------------------------------------

The original issue with the slowness in because of the way column stats are 
processed at HMS. The stats are updated one by one at HMS using JDO 
connections. This was resulting into performance issues as JDO does lots of 
conversion. So the proper fix is to batch the processing into single sql 
statements and execute it using direct sql. 

> Batch process in ColStatsProcessor
> ----------------------------------
>
>                 Key: HIVE-24663
>                 URL: https://issues.apache.org/jira/browse/HIVE-24663
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Rajesh Balamohan
>            Assignee: mahesh kumar behera
>            Priority: Major
>              Labels: performance
>
> When large number of partitions (>20K) are processed, ColStatsProcessor runs 
> into DB issues. 
> {{ db.setPartitionColumnStatistics(request);}} gets stuck for hours together 
> and in some cases postgres stops processing. 
> It would be good to introduce small batches for stats gathering in 
> ColStatsProcessor instead of bulk update.
> Ref: 
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/ColStatsProcessor.java#L181
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/ColStatsProcessor.java#L199



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-24663) Batch process in ColStatsProcessor

Reply via email to