[
https://issues.apache.org/jira/browse/HIVE-22411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16962271#comment-16962271
]
Steve Loughran commented on HIVE-22411:
---------------------------------------
Why do you need to list every single file under a directory tree just to
update the counter? That is a very expensive operation. On S3 it is
O(files/5000) and you are billed for it; With S3Guard it is slightly faster and
you are billed more for it. In both cases it lines you up for throttling by the
service.
Can't Hive count the amount of data during job commit?
> Performance degradation on single row inserts
> ---------------------------------------------
>
> Key: HIVE-22411
> URL: https://issues.apache.org/jira/browse/HIVE-22411
> Project: Hive
> Issue Type: Bug
> Components: Hive
> Reporter: Attila Magyar
> Assignee: Attila Magyar
> Priority: Major
> Fix For: 4.0.0
>
> Attachments: Screen Shot 2019-10-17 at 8.40.50 PM.png
>
>
> Executing single insert statements on a transactional table effects write
> performance on a s3 file system. Each insert creates a new delta directory.
> After each insert hive calculates statistics like number of file in the table
> and total size of the table. In order to calculate these, it traverses the
> directory recursively. During the recursion for each path a separateĀ
> listStatus call is executed. In the end the more delta directory you have the
> more time it takes to calculate the statistics.
> Therefore insertion time goes up linearly:
> !Screen Shot 2019-10-17 at 8.40.50 PM.png|width=601,height=436!
> The fix is to useĀ fs.listFiles(path, /**recursive**/ true) instead the
> handcrafter recursive method/
--
This message was sent by Atlassian Jira
(v8.3.4#803005)